vxrelocd(ADM)

vxrelocd - monitor the Volume Manager for failure events and relocate failed subdisks

Synopsis

/etc/vx/bin/vxrelocd [ -o vxrecover_argument ] [ mail-address ... ]

Description

The vxrelocd command monitors the Volume Manager by analyzing the output of the vxnotify(ADM) command, waiting for failures to occur. When a failure occurs, it sends mail via mailx(C) to root (by default) or to other specified users. It proceeds to attempt to relocate failed subdisks. Once an attempt at relocation is complete, vxrelocd sends more mail indicating the status of each subdisk replacement. vxrecover is then run on volumes with relocated subdisks to restore the data.

Options

The -o option may be specified to vxrelocd with an argument. The -o option and its argument will be passed directly to vxrecover if vxrecover is run. This allows the administrator to specify

-o slow[=iodelay] to keep vxrecover from overloading a busy system during recovery. The default value for the delay is 250 milliseconds.

Mail notification

By default, vxrelocd sends electronic mail to root with information about a detected failure and the status of any relocation and recovery attempts. To instruct vxrelocd to send this mail to other users, add the desired user logins to the vxrelocd startup line in the startup script /etc/rc2.d/s95vxvm-recover and then reboot the system. Alternatively, you can kill the vxrelocd process and restart it as vxrelocd root mail-address, where mail-address is a user's login. Do not kill the vxrelocd process while a relocation attempt is in progress.

The mail notification that is sent when a failure is detected follows this format:

	Failures have been detected by the VERITAS Volume Manager:

        failed disks:
        medianame
          ...
        failed plexes:
        plexname
          ...
        failed log plexes:
        plexname
          ...
        failing disks:
        medianame
          ...
        failed subdisks:
        subdiskname
          ...

	The Volume Manager will attempt to find spare disks, relocate failed
	subdisks and then recover the data in the failed plexes.

The medianame list under failed disks: specifies disks that appear to have completely failed; the medianame list under failing disks: indicates a partial disk failure or a disk that is in the process of failing. When a disk has failed completely, the same medianame list appears under both failed disks: and failing disks:. The plexname list under failed plexes: shows plexes that have been detached due to I/O failures experienced while attempting to do I/O to subdisks they contain. The plexname list under failed log plexes: indicates RAID-5 or DRL log plexes that have experienced failures. The subdiskname list specifies subdisks in RAID-5 volumes that have been detached due to I/O errors.

Spare space

A disk can be marked as "spare." This makes the disk available as a site for relocating failed subdisks. Disks that are marked as spares are not used for normal allocations unless they are explicitly specified by the administrator. This ensures that there is a pool of spare space available for relocating failed subdisks and that this space won't get consumed by normal operations. Spare space is the first space used to relocate failed subdisks. However, if no spare space is available or the available spare space is not suitable or sufficient, free space will also be used. See the vxedit(ADM) and vxdiskadm(ADM) manual pages for more information on marking a disk as a spare.

Replacement procedure

Once mail has been sent, vxrelocd attempts to relocate any subdisks that appear to have failed (i.e. those listed in the subdisks list). This involves finding appropriate spare and/or free space in the same disk group as the failed subdisk. A disk is eligible as replacement space if it is a valid Volume Manager disk and contains enough space to hold the data contained in the failed subdisk. If no space is available on spare disks, the relocation will be attempted using free space instead.

To determine which disk from among the eligible spares should be used, vxrelocd tries to use the disk that is "closest" to the failed disk. The value of "closeness" depends on the controller, target, and disk number of the failed disk. A disk on the same controller as the failed disk is closer than a disk on a different controller; a disk under the same target as the failed disk is closer than one under a different target.

If no spare or free space is found, mail will be sent explaining the disposition of volumes that had storage on the failed disk:

	Hot-relocation was not successful for subdisks on disk dm_name in
	volume v_name in disk group dg_name.  No replacement was made and
	the disk is still unusable.

	The following volumes have storage on medianame:

	volumename
	...

	These volumes are still usable, but the redundancy of
	those volumes is reduced. Any RAID-5 volumes with storage 
	on the failed disk may become unusable in the face of further 
	failures.

If any non-RAID-5 volumes were made unusable due to the failure of the disk, the following message is included:

	The following volumes:

	volumename
	...

	have data on medianame but have no other usable 
	mirrors on other disks. These volumes are now unusable 
	and the data on them is unavailable.  These volumes must
	have their data restored.

If any RAID-5 volumes were made unavailable due to the disk failure, the following message is included:

	The following RAID-5 volumes:

	volumename
	...

	had storage on medianame and have experienced 
	other failures. These RAID-5 volumes are now unusable 
	and data on them is unavailable.  These RAID-5 volumes must
	have their data restored.

If spare space is found, subdisk relocations are attempted. This involves setting up a subdisk on the spare or free space and using it to replace the failed subdisk. If this is successful, the vxrecover(ADM) command is used in the background to recover the contents of any data in volumes that had storage on the disk.

If the relocation fails, the following message is sent:

	Hot-relocation was not successful for subdisks on disk dm_name in
	volume v_name in disk group dg_name.  No replacement was made
	and the disk is still unusable.

	error message

If any volumes (RAID-5 or otherwise) are rendered unusable due to the failure, the following message is included:

	The following volumes:

	volumename
	...

	have data on dm_name but have no other usable mirrors on other
	disks. These volumes are now unusable and the data on them is
	unavailable. These volumes must have their data restored.

If the relocation procedure completed successfully and recovery is under way, the following mail message is sent:

	Volume v_name Subdisk sd_name relocated to newsd_name,
	but not yet recovered.

Once recovery has completed, a mail message will be sent relaying the outcome of the recovery procedure. If the recovery was successful, the following message is included in the mail:

	Recovery complete for volume v_name in disk group dg_name.

If the recovery was not successful, the following message is included in the mail:

	Failure recovering v_name in disk group dg_name.

Disabling vxrelocd

If you do not want automatic subdisk relocation to occur in the event of a failure, you can disable the hot-relocation feature by killing the relocation daemon, vxrelocd, and preventing it from restarting. However, you should not kill the daemon while it is attempting relocation. To kill the daemon, run the command ps -ef from the command line and find the two entries for vxrelocd. Execute the command kill -9 PID1 PID2 (substituting PID1 and PID2 with the process IDs for the two vxrelocd processes). To prevent vxrelocd from being started again, you must comment out the line that starts up vxrelocd in the startup script /etc/rc2.d/s95vxvm-recover.

Files

/etc/rc2.d/s95vxvm-recover: The startup file for vxrelocd.

References

kill(C), mailx(C), ps(C), vxdiskadm(ADM), vxedit(ADM), vxintro(ADM), vxnotify(ADM), vxrecover(ADM)