|
|
VxFS System Administrator's Guide
The following topics are covered in this chapter:
ufs
file system defaults to a 4K block size with a 1K
fragment size. This means that space is
allocated to small files (up to 4K) in 1K
increments. Allocations for larger files are done in 8K increments,
except for the last block which may be a fragment. Since many files are
small, the fragment facility saves a large amount of space when
compared to allocating space 4K at a time.The unit of allocation in VxFS is a block. There are no fragments, since storage is allocated in extents that consist of one or more blocks. For the most efficient space utilization, the smallest block size available on the system should be used. Typically, this provides the best performance as well. The smallest block size available is 1K, which is also the default block size for VxFS file systems created on the system. Unless there are special concerns, there should never be a need to specify a block size when creating file systems.
For large file systems, with relatively few files, the system administrator may wish to experiment with larger block sizes. The trade off of specifying larger block sizes is a decrease in the amount of space used to hold the free extent bitmaps for each allocation unit, an increase in the maximum extent size, and a decrease in the number of extents used per file versus an increase in the amount of space wasted at the end of files that are not a multiple of the block size. Larger block sizes use less disk space in file system overhead, but consume more space for files that are not a multiple of the block size.
Overall file system performance may be improved or degraded by changing the block size. For most applications, it is recommended that the default values for the system be used. However, certain applications may benefit from a larger block size. The easiest way to judge which block sizes will provide the greatest system efficiency is to try representative system loads against various sizes and pick the fastest.
mkfs
utility uses a default intent log size of 1024
blocks. The default
size is sufficient for most workloads. If the system is used as an NFS
server or for intensive synchronous write workloads, performance may be
improved using a larger log size.There are several system performance benchmark suites for which VxFS performs better with larger log sizes. The best way to pick the log size is to try representative system loads against various sizes and pick the fastest.
mount
mode (log
mode), VxFS provides
blkclear
, delaylog
, tmplog
,
nolog
, and nodatainlog
modes of operation.
Caching behavior can be altered with the mincache
option,
and the behavior of O_SYNC
(see the
fcntl(2)
manual page)
writes can be altered with the
convosync
option.
The delaylog
, tmplog,
and nolog
modes are capable of significantly improving performance. The
improvement over log
mode is typically about 15 to 20
percent with delaylog
; with tmplog
and
nolog
, the improvement is even higher. Performance
improvement varies, depending on the operations being performed and the
workload. Read/write intensive loads should show less improvement,
while file system structure intensive loads (such as
mkdir
, create
, rename
, etc.) may
show over 100 percent improvement. The best way to select a mode is to
test representative system loads against the logging modes and compare
the performance results.
Most of the modes can be used in combination. For example, a desktop
machine might use both the blkclear
and
mincache=closesync
modes.
Additional information on mount
options can be found in
the
mount_vxfs(1M)
manual page.
log
log
. With log
mode, VxFS guarantees that all structural changes to the file system
have been logged on disk when the system call returns. If a system
failure occurs, fsck
replays recent changes so that they
will not be lost.delaylog
delaylog
mode, some system calls return before the
intent log is written. This logging delay improves the performance of
the system, but some changes are not guaranteed until a short time
after the system call returns, when the intent log is written. If a
system failure occurs, recent changes may be lost. This mode
approximates traditional UNIX guarantees for correctness in case of
system failures. Fast file system recovery works with this mode.tmplog
tmplog
mode, intent logging is almost always delayed.
This greatly improves performance, but recent changes may disappear if
the system crashes. This mode is only recommended for temporary file
systems. Fast file system recovery works with this mode.nolog
nolog
mode, the intent log is disabled. All I/O
requests, including synchronous I/O, are
performed asynchronously. Unlike the other logging modes, nolog
does not provide fast file system recovery. With nolog
mode, a full structural check must be performed after a crash; this may
result in loss of substantial portions of the file system, depending
upon activity at the time of the crash. Usually, a nolog
file system should be rebuilt with mkfs
after a crash. The
nolog
mode should only be used for memory resident or very
temporary file systems.
nolog
option, as it can cause files to be lost, even if
those files were written synchronously. With nolog
, the
potential loss after a crash is greater than with
ufs
.
nodatainlog
nodatainlog
mode should be used on systems with disks
that do not support bad block revectoring.
Normally, a VxFS file system uses the intent log for synchronous
writes. The inode update and the data are both
logged in the transaction, so a synchronous write only requires one
disk write instead of two. When the synchronous write returns to the
application, the file system has told the application that the data is
already written. If a disk error causes the data update to fail, then
the file must be marked bad and the entire file is lost.
If a disk supports bad block revectoring, then a failure on the data
update is unlikely, so logging synchronous writes should be allowed. If
the disk does not support bad block revectoring, then a failure is more
likely, so the nodatainlog
mode should be used.
A nodatainlog
mode file system should be approximately 50
percent slower than a standard mode VxFS file system for synchronous
writes. Other operations are not affected.
blkclear
blkclear
mode is used in increased data security
environments. The blkclear
mode guarantees that
uninitialized storage never appears in files. The increased integrity is provided by
clearing extents on disk when they are allocated within a file.
Extending writes are not affected by this mode. A blkclear
mode file system should be approximately 10 percent slower than a
standard mode VxFS file system, depending on the workload.mincache
mincache
mode has five suboptions:
mincache=closesync
mincache=direct
mincache=dsync
mincache=unbuffered
mincache=tmpcache
mincache=closesync
mode is useful in desktop
environments where users are likely to shut off the power on the
machine without halting it first. In this mode, any changes to the file
are flushed to disk when the file is closed.
To improve performance, most file systems do not synchronously update
data and inode changes to disk. If the system crashes, files that have
been updated within the past minute are in danger of losing data. With
the mincache=closesync
mode, if the system crashes or is
switched off, only files that are currently open can lose data. A
mincache=closesync
mode file system should be
approximately 15 percent slower than a standard mode VxFS file system,
depending on the workload.
The mincache=direct,
mincache=unbuffered
, and
mincache=dsync
modes are used in environments where
applications are experiencing reliability problems caused by the kernel
buffering of I/O and delayed flushing of non-synchronous I/O. The
mincache=direct
and mincache=unbuffered
modes
guarantee that all non-synchronous I/O requests to files will be handled as if the
VX_DIRECT
or VX_UNBUFFERED
caching advisories
had been specified. The
mincache=dsync
mode guarantees that all non-synchronous
I/O requests to files will be handled as if the VX_DSYNC
caching advisory had been specified. Refer to the
vxfsio(7)
manual page for explanations of VX_DIRECT
,
VX_UNBUFFERED
, and VX_DSYNC
. The
mincache=direct
, mincache=unbuffered
, and
mincache=dsync
modes also flush file data on
close
as mincache=closesync
does.
Since the mincache=direct
,
mincache=unbuffered
, and mincache=dsync
modes
change non-synchronous I/O to synchronous I/O, there can be a
substantial degradation in throughput for small to medium size files
for most applications. Since the VX_DIRECT
and
VX_UNBUFFERED
advisories do not allow any caching of data,
applications that would normally benefit from caching for reads will
usually experience less degradation with the
mincache=dsync
mode. mincache=direct
and
mincache=unbuffered
require significantly less CPU time
than buffered I/O.
If performance is more important than data integrity, the
mincache=tmpcache
mode may be used. The
mincache=tmpcache
mode disables special delayed extending
write handling, trading off less integrity for better performance.
Unlike the other mincache
modes, tmpcache
does not flush the file to disk when it is closed. When this option is
used, garbage may appear in a file that was being extended when a crash
occurred.
convosync
convosync=dsync
option violates
POSIX guarantees for synchronous I/O.
convosync
) mode has five suboptions:
convosync=closesync
, convosync=direct
,
convosync=dsync
, convosync=unbuffered
, and
convosync=delay
.
The convosync=closesync
mode converts synchronous and data
synchronous writes to non-synchronous writes and flushes the changes to
the file to disk when the file is closed.
The convosync=delay
mode causes synchronous and data
synchronous writes to be delayed rather than to take effect
immediately. No special action is performed when closing a file. This
option effectively cancels any data integrity guarantees normally
provided by opening a file with O_SYNC
.
See the
open(2),
fcntl(2),
and
vxfsio(7)
manual pages for more information on O_SYNC
.
convosync=closesync
or convosync=delay
mode,
as they actually change synchronous I/O into non-synchronous I/O. This
may cause applications that use synchronous I/O for data reliability to
fail if the system crashes and synchronously written data is
lost.
convosync=direct
and convosync=unbuffered
mode convert synchronous and data synchronous reads and writes to
direct reads and writes.
The convosync=dsync
mode converts synchronous writes to
data synchronous writes.
As with closesync
, the direct
,
unbuffered
, and dsync
modes flush changes to
the file to disk when it is closed. These modes can be used to speed up
applications that use synchronous I/O. Many applications that are
concerned with data integrity specify the O_SYNC
fcntl
in order to write the file data synchronously.
However, this has the undesirable side effect of updating inode times
and therefore slowing down performance. The
convosync=dsync
, convosync=unbuffered
, and
convosync=direct
modes alleviate this problem by allowing
applications to take advantage of synchronous writes without modifying
inode times as well.
convosync=dsync
,
convosync=unbuffered
, or convosync=direct
,
make sure that all applications that use the file system do not require
synchronous inode time updates for O_SYNC
writes.
Combining mount
Options mount
options can be combined arbitrarily, some
combinations do not make sense. For example, the mount
option combination:
mount -F vxfs -o nolog,blkclear
disables intent logging so that a full fsck
is required
after a system failure, yet it incurs the performance overhead of
clearing each extent for a sparse file before allocating the extent to
the file. Without intent logging, there is no guarantee of data
integrity after a crash, so the clearing of extents allocated to sparse
files is a waste of time.
The following examples provide some common and reasonable
mount
option combinations.
mount -F vxfs -o log,mincache=closesync /dev/dsk/c1b0t3d0s1 /mnt
This guarantees that when a file is closed, its data is synchronized to
disk and cannot be lost. Thus, once an application is exited and its
files are closed, no data will be lost even if the system is
immediately turned off.This combination might be used for a temporary file system where performance is more important than absolute data integrity. Anymount -F vxfs -o tmplog,convosync=delay,mincache=tmpcache \
/dev/dsk/c1b0t3d0s1 /mnt
O_SYNC
writes are performed as delayed writes and delayed
extending writes are not handled specially (which could result in a
file that contains garbage if the system crashes at the wrong time).
Any file written 30 seconds or so before a crash may contain garbage or
be missing if this mount
combination is in effect.
However, such a file system will do significantly less disk writes than
a log file system, and should have significantly better performance,
depending on the application.
mount -F vxfs -o log,convosync=dsync /dev/dsk/c1b0t3d0s1 /mnt
This combination would be used to improve the performance of
applications that perform O_SYNC
writes, but only require
data synchronous write semantics. Their performance can be
significantly improved if the file system is mounted using
convosync=dsync
without any loss of data integrity. vxfs
file system. These tuneables are
discussed in the following sections.
The kernel tuneables are found in files located in
/etc/conf/mtune.d
. The mtune.d
files contain
the system default, minimum, and maximum values for each tuneable. The
/etc/conf/cf.d/stune
file contains entries for tuneables
that cannot use the default value in the mtune.d
file. The stune
file is used to change
tuneable values to something other than the default, when necessary.
s5
,
sfs
, ufs
, and vxfs
file system
types, inodes are cached in a "per file system table," known as the
inode table. Each file system type has a tuneable to determine
the maximum number of entries in its inode table. For the
s5
file system, the tuneable is NINODE
. For ufs
and sfs
, the
tuneable is SFSNINODE
.
The vxfs
file system type (FSType) uses the tuneable
VXFSNINODE
as the maximum number of entries in the
vxfs
inode table. The actual size
of the inode table is dynamically adjusted as the system activity
changes. The VXFSNINODE
parameter is the upper bound on
the size of the inode table.
The value of VXFSNINODE
is determined automatically at
boot time, based on the amount of memory in the system. The value for
VXFSNINODE
should generally be left alone and not altered.
The current size of the table and number of inodes in use can be
monitored using sar
with the -t
option.
If the VXFSNINODE
value is too small, the system may run
out of inode-table entries, and system calls may fail with
ENFILE
. If the value is too large, excessive memory may be
consumed by inode-table entries, which would adversely affect system
performance. Before overriding the automatically determined value by
specifying a positive value for this tuneable, the system administrator
should verify that the problem prompting this change is not due to an
application that forgets to close files.
df
command to monitor free space is desirable.
Full file systems may have an adverse effect on file system
performance. Full file systems should therefore have some files
removed, or should be expanded (see the fsadm_vxfs(1M) manual
page for a description of online file system expansion). fsadm
's fragmentation reporting and reorganization
facilities is therefore advisable.
The easiest way to ensure that fragmentation does not become a problem
is to schedule regular defragmentation runs from cron
.
Defragmentation scheduling should range from weekly (for frequently used file systems) to
monthly (for infrequently used file systems). Extent fragmentation
should be monitored with fsadm
or
the -o s
option of df
. There are three factors which can be
used to determine the degree of fragmentation:
fsadm
runs at the initial interval, and running
the extent fragmentation report feature of fsadm
before
and after the reorganization.
The "before" result is the degree of fragmentation prior to the
reorganization. If the degree of fragmentation is approaching the
figures for bad fragmentation, then the interval between
fsadm
runs should be reduced. If the degree of
fragmentation is low, the interval between fsadm
runs can
be increased.
The "after" result is an indication of how well the reorganizer is performing. If the degree of fragmentation is not close to the characteristics of an unfragmented file system, then the extent reorganizer is not functioning properly. The file system may be a candidate for expansion. (Full file systems tend to fragment and are difficult to defragment.) It is also possible that the reorganization is not being performed at a time during which the file system in question is relatively idle.
Directory reorganization is not nearly as critical as extent
reorganization, but
regular directory reorganization will improve performance. It is
advisable to schedule directory reorganization for file systems when
the extent reorganization is scheduled. The following is a sample
script that is run periodically at 3:00 A.M.
from cron
for a number of file systems:
outfile=/usr/spool/fsadm/out.'/bin/date +'%m%d'' for i in /home /home2 /project /db do /bin/echo "Reorganizing $i" /bin/timex /etc/fs/vxfs/fsadm -e -E -s $i /bin/timex /etc/fs/vxfs/fsadm -s -d -D $i done > $outfile 2>&1
If the VxFS file system is being used with the VERITAS Volume Manager,
the file system queries the Volume Manager to find out the geometry of
the underlying volume and automatically sets the I/O parameters. The
Volume Manager is queried by mkfs
when the file system is
created to automatically align the file system to the volume geometry.
Then the mount
command queries the Volume Manager when the
file system is mounted and downloads the I/O parameters.
If the default parameters are not acceptable
or the file system is being used without the Volume Manager, then the
/etc/vx/tunefstab
file can be used to set values for I/O
parameters. The mount
command reads the
/etc/vx/tunefstab
file and downloads any parameters
specified for a file system. The tunefstab
file overrides
any values obtained from the Volume Manager. While the file system is
mounted, any I/O parameters can be changed using the
vxtunefs
command which can have tuneables specified on the
command line or can read them from the /etc/vx/tunefstab
file.
For more details, see the
vxtunefs(1M)
and
tunefstab(4)
manual pages.
The vxtunefs
command can be used to print the current values of the I/O
parameters.
If the default alignment from mkfs
is not acceptable, the
-o
align=
n option can be used to override
alignment information obtained from the Volume Manager.
read_pref_io
read_nstream
value to determine how much data to read ahead. The default value is
64K.
read_nstream
read_pref_io
to have outstanding at one
time. The file system uses the product of read_nstream
multiplied by read_pref_io
to determine its read ahead
size. The default value for read_nstream
is 1.
read_unit_io
write_pref_io
write_nstream
value to determine how to do flush behind on
writes. The default value is 64K.
write_nstream
write_pref_io
to have outstanding
at one time. The file system uses the product of
write_nstream
multiplied by write_pref_io
to
determine when to do flush behind on writes. The default value for
write_nstream
is 1.
write_unit_io
pref_strength
buf_breakup_size
max_direct_iosz
max_direct_iosz
chunks. This parameter defines how much memory an I/O request can lock
at once, so it should not be set to more than 20 percent of
memory.
discovered_direct_iosz
discovered_direct_iosz
are handled as
discovered direct I/O. A discovered direct I/O is unbuffered similar to
direct I/O, but it does not require a synchronous commit of the inode
when the file is extended or blocks are allocated. For larger I/O
requests, the CPU time for copying the data into the page cache and the
cost of using memory to buffer the I/O data becomes more expensive than
the cost of doing the disk I/O. For these I/O requests, using
discovered direct I/O is more efficient than regular I/O. The default
value of this parameter is 256K.
default_indir_size
On VxFS, files can have up to
10 direct extents of variable size stored in the inode. Once these
extents are used up, the file must use indirect extents which are a
fixed size that is set when the file first uses indirect extents. These
indirect extents are 8K by default. The file system does not use larger
indirect extents because it must fail a write and return
ENOSPC
if there are no extents available that are the
indirect extent size. For file systems with a lot of large files, the
8K indirect extent size is too small. The files that get into indirect
extents use a lot of smaller extents instead of a few larger ones. By
using this parameter, the default indirect extent size can be increased
so large that files in indirects use fewer larger extents. default_indir_size
should be used carefully. If
it is set too large, then writes will fail when they are unable to
allocate extents of the indirect extent size to a file. In
max_diskq
Limits the maximum disk queue generated
by a single file. When the file system is flushing data for a file and
the number of pages being flushed exceeds max_diskq
,
processes will block until the amount of data being flushed decreases.
Although this doesn't limit the actual disk queue, it prevents flushing
processes from making the system unresponsive. The default value is 1
MB.
max_extent_size
Increases
or decreases the maximum size of an extent. When the file system is
following its default allocation policy for sequential writes to a
file, it allocates an initial extent which is large enough for the
first write to the file. When additional extents are allocated, they
are progressively larger (the algorithm tries to double the size of the
file with each new extent) so each extent can hold several writes worth
of data. This is done to reduce the total number of extents in
anticipation of continued sequential writes. When the file stops being
written, any unused space is freed for other files to
use.max_extent_size
is measured in file system
blocks.
def_init_extent
Changes the default initial extent
size. VxFS determines, based on the first write to a new file, the size
of the first extent to be allocated to the file. Normally the first
extent is the smallest power of 2 that is larger than the size of the
first write. If that power of 2 is less than 8K, the first extent
allocated is 8K. After the initial extent, the file system increases
the size of subsequent extents (see max_extent_size
) with
each allocation.def_init_extent
can change the
default initial extent size to be larger, so the doubling policy will
start from a much larger initial size and the file system will not
allocate a set of small extents at the start of file.
If the file system is being used with a hardware disk array or volume
manager other than VxVM, try to align the parameters to match the
geometry of the logical disk. With striping or RAID-5, it is common to
set read_pref_io
to the stripe unit size and
read_nstream
to the number of columns in the stripe. For
striping arrays, use the same values for write_pref_io
and
write_nstream
, but for RAID-5 arrays, set
write_pref_io
to the full stripe size and
write_nstream
to 1.
For an application to do efficient disk I/O, it should issue read
requests that are equal to the product of read_nstream
multiplied by read_pref_io
. Generally, any multiple or
factor of read_nstream
multiplied by
read_pref_io
should be a good size for performance. For
writing, the same rule of thumb applies to the
write_pref_io
and write_nstream
parameters.
When tuning a file system, the best thing to do is try out the tuning
parameters under a real life workload.
If an application is doing sequential I/O to large files, it should try
to issue requests larger than the discovered_direct_iosz
.
This causes the I/O requests to be performed as discovered direct I/O
requests, which are unbuffered like direct I/O but do not require
synchronous inode updates when extending the file. If the file is
larger than can fit in the cache, then using unbuffered I/O avoids
throwing useful data out of the cache and it avoids a lot of CPU
overhead.