VxFS System Administrator's Guide

Performance and Tuning

Chapter 5

Introduction

For any file system, the ability to provide peak performance is important. Adjusting the available VERITAS File System options provides a way to optimize system performance. This chapter describes tools that an administrator can use to optimize VxFS. To optimize an application for use with VxFS, see Chapter 6, "Application Interface."

The following topics are covered in this chapter:

Choosing a Block Size

Note: The block size is chosen when a file system is created and cannot be changed later.

The standard ufs file system defaults to a 4K block size with a 1K fragment size. This means that space is allocated to small files (up to 4K) in 1K increments. Allocations for larger files are done in 8K increments, except for the last block which may be a fragment. Since many files are small, the fragment facility saves a large amount of space when compared to allocating space 4K at a time.

The unit of allocation in VxFS is a block. There are no fragments, since storage is allocated in extents that consist of one or more blocks. For the most efficient space utilization, the smallest block size available on the system should be used. Typically, this provides the best performance as well. The smallest block size available is 1K, which is also the default block size for VxFS file systems created on the system. Unless there are special concerns, there should never be a need to specify a block size when creating file systems.

For large file systems, with relatively few files, the system administrator may wish to experiment with larger block sizes. The trade off of specifying larger block sizes is a decrease in the amount of space used to hold the free extent bitmaps for each allocation unit, an increase in the maximum extent size, and a decrease in the number of extents used per file versus an increase in the amount of space wasted at the end of files that are not a multiple of the block size. Larger block sizes use less disk space in file system overhead, but consume more space for files that are not a multiple of the block size.

Overall file system performance may be improved or degraded by changing the block size. For most applications, it is recommended that the default values for the system be used. However, certain applications may benefit from a larger block size. The easiest way to judge which block sizes will provide the greatest system efficiency is to try representative system loads against various sizes and pick the fastest.

Choosing an Intent Log Size

Note: The intent log size is chosen when a file system is created and cannot be subsequently changed.

The mkfs utility uses a default intent log size of 1024 blocks. The default size is sufficient for most workloads. If the system is used as an NFS server or for intensive synchronous write workloads, performance may be improved using a larger log size.

There are several system performance benchmark suites for which VxFS performs better with larger log sizes. The best way to pick the log size is to try representative system loads against various sizes and pick the fastest.

Note: When a larger intent log size is chosen, recovery time will be proportionately longer and the file system may consume more system resources (such as memory) during normal operation.

Choosing Mount Options

In addition to the standard mount mode (log mode), VxFS provides blkclear, delaylog, tmplog, nolog, and nodatainlog modes of operation. Caching behavior can be altered with the mincache option, and the behavior of O_SYNC (see the fcntl(2) manual page) writes can be altered with the convosync option.

The delaylog, tmplog, and nolog modes are capable of significantly improving performance. The improvement over log mode is typically about 15 to 20 percent with delaylog; with tmplog and nolog, the improvement is even higher. Performance improvement varies, depending on the operations being performed and the workload. Read/write intensive loads should show less improvement, while file system structure intensive loads (such as mkdir, create, rename, etc.) may show over 100 percent improvement. The best way to select a mode is to test representative system loads against the logging modes and compare the performance results.

Most of the modes can be used in combination. For example, a desktop machine might use both the blkclear and mincache=closesync modes.

Additional information on mount options can be found in the mount_vxfs(1M) manual page.

`log`

The default logging mode is log. With log mode, VxFS guarantees that all structural changes to the file system have been logged on disk when the system call returns. If a system failure occurs, fsck replays recent changes so that they will not be lost.

`delaylog`

In delaylog mode, some system calls return before the intent log is written. This logging delay improves the performance of the system, but some changes are not guaranteed until a short time after the system call returns, when the intent log is written. If a system failure occurs, recent changes may be lost. This mode approximates traditional UNIX guarantees for correctness in case of system failures. Fast file system recovery works with this mode.

`tmplog`

In tmplog mode, intent logging is almost always delayed. This greatly improves performance, but recent changes may disappear if the system crashes. This mode is only recommended for temporary file systems. Fast file system recovery works with this mode.

`nolog`

In nolog mode, the intent log is disabled. All I/O requests, including synchronous I/O, are performed asynchronously. Unlike the other logging modes, nolog does not provide fast file system recovery. With nolog mode, a full structural check must be performed after a crash; this may result in loss of substantial portions of the file system, depending upon activity at the time of the crash. Usually, a nolog file system should be rebuilt with mkfs after a crash. The nolog mode should only be used for memory resident or very temporary file systems.

CAUTION! Care should be taken when using the nolog option, as it can cause files to be lost, even if those files were written synchronously. With nolog, the potential loss after a crash is greater than with ufs.

`nodatainlog`

The nodatainlog mode should be used on systems with disks that do not support bad block revectoring. Normally, a VxFS file system uses the intent log for synchronous writes. The inode update and the data are both logged in the transaction, so a synchronous write only requires one disk write instead of two. When the synchronous write returns to the application, the file system has told the application that the data is already written. If a disk error causes the data update to fail, then the file must be marked bad and the entire file is lost.

If a disk supports bad block revectoring, then a failure on the data update is unlikely, so logging synchronous writes should be allowed. If the disk does not support bad block revectoring, then a failure is more likely, so the nodatainlog mode should be used.

A nodatainlog mode file system should be approximately 50 percent slower than a standard mode VxFS file system for synchronous writes. Other operations are not affected.

`blkclear`

The blkclear mode is used in increased data security environments. The blkclear mode guarantees that uninitialized storage never appears in files. The increased integrity is provided by clearing extents on disk when they are allocated within a file. Extending writes are not affected by this mode. A blkclear mode file system should be approximately 10 percent slower than a standard mode VxFS file system, depending on the workload.

`mincache`

The mincache mode has five suboptions:

mincache=closesync
mincache=direct
mincache=dsync
mincache=unbuffered
mincache=tmpcache

The mincache=closesync mode is useful in desktop environments where users are likely to shut off the power on the machine without halting it first. In this mode, any changes to the file are flushed to disk when the file is closed.

To improve performance, most file systems do not synchronously update data and inode changes to disk. If the system crashes, files that have been updated within the past minute are in danger of losing data. With the mincache=closesync mode, if the system crashes or is switched off, only files that are currently open can lose data. A mincache=closesync mode file system should be approximately 15 percent slower than a standard mode VxFS file system, depending on the workload.

The mincache=direct, mincache=unbuffered, and mincache=dsync modes are used in environments where applications are experiencing reliability problems caused by the kernel buffering of I/O and delayed flushing of non-synchronous I/O. The mincache=direct and mincache=unbuffered modes guarantee that all non-synchronous I/O requests to files will be handled as if the VX_DIRECT or VX_UNBUFFERED caching advisories had been specified. The mincache=dsync mode guarantees that all non-synchronous I/O requests to files will be handled as if the VX_DSYNC caching advisory had been specified. Refer to the vxfsio(7) manual page for explanations of VX_DIRECT, VX_UNBUFFERED, and VX_DSYNC. The mincache=direct, mincache=unbuffered, and mincache=dsync modes also flush file data on close as mincache=closesync does.

Since the mincache=direct, mincache=unbuffered, and mincache=dsync modes change non-synchronous I/O to synchronous I/O, there can be a substantial degradation in throughput for small to medium size files for most applications. Since the VX_DIRECT and VX_UNBUFFERED advisories do not allow any caching of data, applications that would normally benefit from caching for reads will usually experience less degradation with the mincache=dsync mode. mincache=direct and mincache=unbuffered require significantly less CPU time than buffered I/O.

If performance is more important than data integrity, the mincache=tmpcache mode may be used. The mincache=tmpcache mode disables special delayed extending write handling, trading off less integrity for better performance. Unlike the other mincache modes, tmpcache does not flush the file to disk when it is closed. When this option is used, garbage may appear in a file that was being extended when a crash occurred.

`convosync`

Note: Use of the convosync=dsync option violates POSIX guarantees for synchronous I/O.

The "convert osync" (convosync) mode has five suboptions: convosync=closesync, convosync=direct, convosync=dsync, convosync=unbuffered, and convosync=delay.

The convosync=closesync mode converts synchronous and data synchronous writes to non-synchronous writes and flushes the changes to the file to disk when the file is closed.

The convosync=delay mode causes synchronous and data synchronous writes to be delayed rather than to take effect immediately. No special action is performed when closing a file. This option effectively cancels any data integrity guarantees normally provided by opening a file with O_SYNC. See the open(2), fcntl(2), and vxfsio(7) manual pages for more information on O_SYNC.

Note: Extreme care should be taken when using the convosync=closesync or convosync=delay mode, as they actually change synchronous I/O into non-synchronous I/O. This may cause applications that use synchronous I/O for data reliability to fail if the system crashes and synchronously written data is lost.

The convosync=direct and convosync=unbuffered mode convert synchronous and data synchronous reads and writes to direct reads and writes.

The convosync=dsync mode converts synchronous writes to data synchronous writes.

As with closesync, the direct, unbuffered, and dsync modes flush changes to the file to disk when it is closed. These modes can be used to speed up applications that use synchronous I/O. Many applications that are concerned with data integrity specify the O_SYNC fcntl in order to write the file data synchronously. However, this has the undesirable side effect of updating inode times and therefore slowing down performance. The convosync=dsync, convosync=unbuffered, and convosync=direct modes alleviate this problem by allowing applications to take advantage of synchronous writes without modifying inode times as well.

Note: Before using convosync=dsync, convosync=unbuffered, or convosync=direct, make sure that all applications that use the file system do not require synchronous inode time updates for O_SYNC writes.

Combining `mount` Options

Although mount options can be combined arbitrarily, some combinations do not make sense. For example, the mount option combination:

	mount -F vxfs -o nolog,blkclear

disables intent logging so that a full fsck is required after a system failure, yet it incurs the performance overhead of clearing each extent for a sparse file before allocating the extent to the file. Without intent logging, there is no guarantee of data integrity after a crash, so the clearing of extents allocated to sparse files is a waste of time.

The following examples provide some common and reasonable mount option combinations.

Example 1 - Desktop File System

	mount -F vxfs -o log,mincache=closesync /dev/dsk/c1b0t3d0s1 /mnt

This guarantees that when a file is closed, its data is synchronized to disk and cannot be lost. Thus, once an application is exited and its files are closed, no data will be lost even if the system is immediately turned off.

Example 2 - Temporary File System or Restoring from Backup

	mount -F vxfs -o tmplog,convosync=delay,mincache=tmpcache \
	/dev/dsk/c1b0t3d0s1 /mnt

This combination might be used for a temporary file system where performance is more important than absolute data integrity. Any O_SYNC writes are performed as delayed writes and delayed extending writes are not handled specially (which could result in a file that contains garbage if the system crashes at the wrong time). Any file written 30 seconds or so before a crash may contain garbage or be missing if this mount combination is in effect. However, such a file system will do significantly less disk writes than a log file system, and should have significantly better performance, depending on the application.

Example 3 - Data Synchronous Writes

	mount -F vxfs -o log,convosync=dsync /dev/dsk/c1b0t3d0s1 /mnt

This combination would be used to improve the performance of applications that perform O_SYNC writes, but only require data synchronous write semantics. Their performance can be significantly improved if the file system is mounted using convosync=dsync without any loss of data integrity.

Kernel Tuneables

There are several kernel tuneables that pertain to the vxfs file system. These tuneables are discussed in the following sections.

The kernel tuneables are found in files located in /etc/conf/mtune.d. The mtune.d files contain the system default, minimum, and maximum values for each tuneable. The /etc/conf/cf.d/stune file contains entries for tuneables that cannot use the default value in the mtune.d file. The stune file is used to change tuneable values to something other than the default, when necessary.

Internal Inode Table Size

In the s5, sfs, ufs, and vxfs file system types, inodes are cached in a "per file system table," known as the inode table. Each file system type has a tuneable to determine the maximum number of entries in its inode table. For the s5 file system, the tuneable is NINODE. For ufs and sfs, the tuneable is SFSNINODE.

The vxfs file system type (FSType) uses the tuneable VXFSNINODE as the maximum number of entries in the vxfs inode table. The actual size of the inode table is dynamically adjusted as the system activity changes. The VXFSNINODE parameter is the upper bound on the size of the inode table.

The value of VXFSNINODE is determined automatically at boot time, based on the amount of memory in the system. The value for VXFSNINODE should generally be left alone and not altered. The current size of the table and number of inodes in use can be monitored using sar with the -t option.

If the VXFSNINODE value is too small, the system may run out of inode-table entries, and system calls may fail with ENFILE. If the value is too large, excessive memory may be consumed by inode-table entries, which would adversely affect system performance. Before overriding the automatically determined value by specifying a positive value for this tuneable, the system administrator should verify that the problem prompting this change is not due to an application that forgets to close files.

VxVM Maximum I/O Size

If the file system is being used in conjunction with the VERITAS Volume Manager, then the Volume Manager by default breaks up I/O requests larger than 256K. If you are using striping, for optimal performance, the file system issues I/O requests that are full stripes. If the stripe size is larger than 256K, those requests are broken up.

Monitoring Free Space

In general, VxFS works best if the percentage of free space in the file system does not get below 10 percent. This is because file systems with 10 percent or more free space have less fragmentation and better extent allocation. Regular use of the df command to monitor free space is desirable. Full file systems may have an adverse effect on file system performance. Full file systems should therefore have some files removed, or should be expanded (see the fsadm_vxfs(1M) manual page for a description of online file system expansion).

Monitoring Fragmentation

Fragmentation reduces performance and availability. Regular use of fsadm's fragmentation reporting and reorganization facilities is therefore advisable.

The easiest way to ensure that fragmentation does not become a problem is to schedule regular defragmentation runs from cron.

Defragmentation scheduling should range from weekly (for frequently used file systems) to monthly (for infrequently used file systems). Extent fragmentation should be monitored with fsadm or the -o s option of df. There are three factors which can be used to determine the degree of fragmentation:

percentage of free space in extents of less than eight blocks in length
percentage of free space in extents of less than 64 blocks in length
percentage of free space in extents of length 64 blocks or greater

An unfragmented file system will have the following characteristics:

less than 1 percent of free space in extents of less than eight blocks in length
less than 5 percent of free space in extents of less than 64 blocks in length
more than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks

A badly fragmented file system will have one or more of the following characteristics:

greater than 5 percent of free space in extents of less than 8 blocks in length
more than 50 percent of free space in extents of less than 64 blocks in length
less than 5 percent of the total file system size available as free extents in lengths of 64 or more blocks

The optimal period for scheduling of extent reorganization runs can be determined by choosing a reasonable interval, scheduling fsadm runs at the initial interval, and running the extent fragmentation report feature of fsadm before and after the reorganization.

The "before" result is the degree of fragmentation prior to the reorganization. If the degree of fragmentation is approaching the figures for bad fragmentation, then the interval between fsadm runs should be reduced. If the degree of fragmentation is low, the interval between fsadm runs can be increased.

The "after" result is an indication of how well the reorganizer is performing. If the degree of fragmentation is not close to the characteristics of an unfragmented file system, then the extent reorganizer is not functioning properly. The file system may be a candidate for expansion. (Full file systems tend to fragment and are difficult to defragment.) It is also possible that the reorganization is not being performed at a time during which the file system in question is relatively idle.

Directory reorganization is not nearly as critical as extent reorganization, but regular directory reorganization will improve performance. It is advisable to schedule directory reorganization for file systems when the extent reorganization is scheduled. The following is a sample script that is run periodically at 3:00 A.M. from cron for a number of file systems:

outfile=/usr/spool/fsadm/out.'/bin/date +'%m%d''
for i in /home /home2 /project /db
do
	/bin/echo "Reorganizing $i"
	/bin/timex /etc/fs/vxfs/fsadm -e -E -s $i
	/bin/timex /etc/fs/vxfs/fsadm -s -d -D $i
done > $outfile 2>&1

I/O Tuning

Note: The tuneables and the techniques described in this section are for tuning on a per file system basis and should be used judiciously based on the underlying device properties and characteristics of the applications that use the file system.

Performance of a file system can be enhanced by a suitable choice of I/O sizes and proper alignment of the I/O requests based on the requirements of the underlying special device. VxFS provides tools to tune the file systems.

Tuning VxFS I/O Parameters

The VxFS file system provides a set of tuneable I/O parameters that control some of its behavior. These I/O parameters are useful to help the file system adjust to striped or RAID-5 volumes that could yield performance far superior to a single disk. Typically, data streaming applications that access large files see the largest benefit from tuning the file system.

If the VxFS file system is being used with the VERITAS Volume Manager, the file system queries the Volume Manager to find out the geometry of the underlying volume and automatically sets the I/O parameters. The Volume Manager is queried by mkfs when the file system is created to automatically align the file system to the volume geometry. Then the mount command queries the Volume Manager when the file system is mounted and downloads the I/O parameters.

If the default parameters are not acceptable or the file system is being used without the Volume Manager, then the /etc/vx/tunefstab file can be used to set values for I/O parameters. The mount command reads the
/etc/vx/tunefstab file and downloads any parameters specified for a file system. The tunefstab file overrides any values obtained from the Volume Manager. While the file system is mounted, any I/O parameters can be changed using the vxtunefs command which can have tuneables specified on the command line or can read them from the /etc/vx/tunefstab file. For more details, see the vxtunefs(1M) and tunefstab(4) manual pages. The vxtunefs command can be used to print the current values of the I/O parameters.

If the default alignment from mkfs is not acceptable, the -o align=n option can be used to override alignment information obtained from the Volume Manager.

Tuneable VxFS I/O Parameters

The tuneable VxFS I/O parameters are:

read_pref_io The preferred read request size. The file system uses this in conjunction with the read_nstream value to determine how much data to read ahead. The default value is 64K.
read_nstream The desired number of parallel read requests of size read_pref_io to have outstanding at one time. The file system uses the product of read_nstream multiplied by read_pref_io to determine its read ahead size. The default value for read_nstream is 1.
read_unit_io This is a less preferred request size. Currently, the file system does not use this tuneable.
write_pref_io The preferred write request size. The file system uses this in conjunction with the write_nstream value to determine how to do flush behind on writes. The default value is 64K.
write_nstream The desired number of parallel write requests of size write_pref_io to have outstanding at one time. The file system uses the product of write_nstream multiplied by write_pref_io to determine when to do flush behind on writes. The default value for write_nstream is 1.
write_unit_io This is a less preferred request size. Currently, the file system does not use this tuneable.
pref_strength Indicates to the file system how large a performance gain might be made by adhering to the preferred I/O sizes. The file system does not use this tuneable.
buf_breakup_size Tells the file system how large an I/O it can issue without a driver breaking up the request. The file system does not use this tuneable.
max_direct_iosz The maximum size of a direct I/O request that will be issued by the file system. If a larger I/O request comes in, then it is broken up into max_direct_iosz chunks. This parameter defines how much memory an I/O request can lock at once, so it should not be set to more than 20 percent of memory.
discovered_direct_iosz Any file I/O requests larger than the discovered_direct_iosz are handled as discovered direct I/O. A discovered direct I/O is unbuffered similar to direct I/O, but it does not require a synchronous commit of the inode when the file is extended or blocks are allocated. For larger I/O requests, the CPU time for copying the data into the page cache and the cost of using memory to buffer the I/O data becomes more expensive than the cost of doing the disk I/O. For these I/O requests, using discovered direct I/O is more efficient than regular I/O. The default value of this parameter is 256K.
default_indir_sizeOn VxFS, files can have up to 10 direct extents of variable size stored in the inode. Once these extents are used up, the file must use indirect extents which are a fixed size that is set when the file first uses indirect extents. These indirect extents are 8K by default. The file system does not use larger indirect extents because it must fail a write and return ENOSPC if there are no extents available that are the indirect extent size. For file systems with a lot of large files, the 8K indirect extent size is too small. The files that get into indirect extents use a lot of smaller extents instead of a few larger ones. By using this parameter, the default indirect extent size can be increased so large that files in indirects use fewer larger extents. The tuneable default_indir_size should be used carefully. If it is set too large, then writes will fail when they are unable to allocate extents of the indirect extent size to a file. In
max_diskqLimits the maximum disk queue generated by a single file. When the file system is flushing data for a file and the number of pages being flushed exceeds max_diskq, processes will block until the amount of data being flushed decreases. Although this doesn't limit the actual disk queue, it prevents flushing processes from making the system unresponsive. The default value is 1 MB.
max_extent_sizeIncreases or decreases the maximum size of an extent. When the file system is following its default allocation policy for sequential writes to a file, it allocates an initial extent which is large enough for the first write to the file. When additional extents are allocated, they are progressively larger (the algorithm tries to double the size of the file with each new extent) so each extent can hold several writes worth of data. This is done to reduce the total number of extents in anticipation of continued sequential writes. When the file stops being written, any unused space is freed for other files to use. Normally this allocation stops increasing the size of extents at 2048 blocks which prevents one file from holding too much unused space. max_extent_size is measured in file system blocks.
def_init_extentChanges the default initial extent size. VxFS determines, based on the first write to a new file, the size of the first extent to be allocated to the file. Normally the first extent is the smallest power of 2 that is larger than the size of the first write. If that power of 2 is less than 8K, the first extent allocated is 8K. After the initial extent, the file system increases the size of subsequent extents (see max_extent_size) with each allocation. Since most applications write to files using a buffer size of 8K or less, the increasing extents start doubling from a small initial extent. def_init_extent can change the default initial extent size to be larger, so the doubling policy will start from a much larger initial size and the file system will not allocate a set of small extents at the start of file. This parameter should only be used on file systems that will have a very large average file size. On these file systems it will r

If the file system is being used with the Volume Manager, it is advisable to let the VxFS I/O parameters get set to default values based on the volume geometry.

If the file system is being used with a hardware disk array or volume manager other than VxVM, try to align the parameters to match the geometry of the logical disk. With striping or RAID-5, it is common to set read_pref_io to the stripe unit size and read_nstream to the number of columns in the stripe. For striping arrays, use the same values for write_pref_io and write_nstream, but for RAID-5 arrays, set write_pref_io to the full stripe size and write_nstream to 1.

For an application to do efficient disk I/O, it should issue read requests that are equal to the product of read_nstream multiplied by read_pref_io. Generally, any multiple or factor of read_nstream multiplied by read_pref_io should be a good size for performance. For writing, the same rule of thumb applies to the write_pref_io and write_nstream parameters. When tuning a file system, the best thing to do is try out the tuning parameters under a real life workload.

If an application is doing sequential I/O to large files, it should try to issue requests larger than the discovered_direct_iosz. This causes the I/O requests to be performed as discovered direct I/O requests, which are unbuffered like direct I/O but do not require synchronous inode updates when extending the file. If the file is larger than can fit in the cache, then using unbuffered I/O avoids throwing useful data out of the cache and it avoids a lot of CPU overhead.