vxio(HW)
vxio - Volume Manager virtual disk devices 
 Description
There are two types of Volume Manager virtual disk devices: volume devices and plex devices.  The volume devices support a virtual disk access method with disk mirroring and disk striping.  A volume is a logical entity composed of one or more plexes.  A read can be satisfied from any plex, while a write is directed to all plexes.  The virtual disk devices have a wide variety of behaviors, which are programmable through the /dev/vx/config device.  For volume devices, both block- and character-special devices are implemented. 
Each plex in the volume is a copy of the volume address space.  The plex has subdisks associated with it.  These subdisks provide backup storage for the volume address space. 
The plex virtual disk devices are implemented only as character-special devices. When a plex is associated with a volume, the vxconfigd process creates a device node for accessing the plex.  This node has the same uid, gid, and mode as the volume device it belongs to. A plex device acts like an alternate interface into the volume device. An open on a plex device opens the associated volume device.  A read on a plex device reads data through the volume, but only from the individual plex as specified by the plex device used.  A write on a plex device writes data to the volume, but only to the individual plex.  Direct use of plex devices is primarily useful for examining plexes, such as with fsck -n.  They can also be used as part of a backup scheme, where one plex of a volume is detached and then used as a stable image of the volume for backup purposes. 
This is useful for recovery from unusual conditions, where it is desirable to examine each plex of a volume. 
It is possible to create a sparse plex, which is a plex without backup storage for some of the volume address space.  The areas of a sparse plex that do not have a backing subdisk are called holes.  An attempt to read a hole in a sparse plex fails.  If there are other plexes that have backup storage for the read, then one of those plexes is read.  Otherwise, the read fails.  A write to a hole in a sparse plex is considered a success even though the data can't be read back. 
In addition, a plex may be designated as a logging plex.  This means that a log of blocks that are in transition will be kept, which enables fast recovery after a system failure.  This feature is known as DRL, or dirty region logging.  The log for each plex consists of a specially designated subdisk that is not part of the normal plex address space. 
 Ioctls
The ioctl commands supported by the volume virtual disk device interface are discussed later in this section. The only ioctls supported for plex devices are GET_DAEMON  and GET_VOLINFO. The format for calling each ioctl command is: 
	#include <sys/types.h>
	#include <sys/volclient.h>
	struct tag arg;
	int ioctl (int fd, int cmd, struct tag *arg);
The value of cmd is the ioctl command code, and arg is usually a pointer to a structure containing the arguments that need to be passed to the kernel. 
The return value for all these ioctls, with some exceptions, is 0 if the command was successful, and -1 if it was rejected.  If the return value is -1, then errno is set to indicate the cause of the error. 
The following ioctl commands are supported: 
- GET_DAEMON 
- This ioctl returns the pid of the process with the vx/config device open, or 0 if the vx/config device 
is closed.  The value of arg is undefined and should be  NULL . 
 
- GET_VOLINFO 
- This command accepts a pointer to a volinfo structure as an argument.  It fills in the volinfo structure 
with the corresponding values from the kernel.  The members of a volinfo structure are: 
 
- 
 
	long      version;               /* kernel version number */
	long      max_volprivmem;        /* max size of volprivmem area */
	major_t   volbmajor;             /* volume blk dev major number */
	major_t   volcmajor;             /* volume char dev major number */
	major_t   plexmajor;             /* plex device major number */
	long      maxvol;                /* max # of volumes supported */
	long      maxplex;               /* max # of associated plexes */
	long      plexnum;               /* max plexes per volume */
	long      sdnum;                 /* max subdisks per plex */
	long      max_ioctl;             /* max size of ioctl data */
	long      max_specio;            /* max size of ioctl I/O op */
	long      max_io;                /* max size of I/O operation */
	long      vol_maxkiocount;       /* max # top level I/Os allowed */
	long      dflt_iodelay;          /* default I/O delay for utils */
	long      max_parallelio;        /* max # voldios allowed */
	long      voldrl_min_regionsz;   /* min DRL region size */
	long      voldrl_max_drtregs;    /* max # of DRL dirty regions */
	long      vol_is_root;           /* if set, root is volume */
	long      mvrmaxround;           /* max round-robin region size */
	long      prom_version;          /* PROM version of the system */
	long      vol_maxstablebufsize;  /* max size of copy buffer */
	size_t    voliot_iobuf_limit;    /* max total I/O trace buf spc */
	size_t    voliot_iobuf_max;      /* max size of I/O trace buffer */
	size_t    voliot_iobuf_default;  /* default I/O trace buf size */
	size_t    voliot_errbuf_default; /* default error trace buf size */
	long      voliot_max_open;       /* max # of trace channels */
	size_t    vol_checkpt_default;   /* default checkpoint size */
	long      volraid_rsrtransmax;   /* max # of transient RSRs */
- PLEX_DETACH 
- This command is used to force a plex to be detached from a volume.  The name of the plex is passed 
in as an argument.  The volume is the volume device against which the ioctl is being 
performed. 
 
- VOL_LOG_WRITE 
- This command forces a dirty region log to be flushed to disk.  This is used by the vxconfigd process 
to flush an initial log to disk before starting the volume. 
 
- VOL_READ, VOL_WRITE 
- These commands provide a mechanism by which I/O can be issued to volumes larger than 2 
gigabytes in length.  Current UNIX read and write system calls on 32-bit processors limit 
sizes to 2 gigabytes because of the signed byte-offset value used to perform the I/O. These 
ioctl commands provide a method of providing sector offsets to an I/O and raise the limit 
to one sector less than 1024 gigabytes. 
 
- The required I/O is identified to the command by the use of a vol_rdwr structure containing the 
following: 
 
	ulong_t   vrw_flags;              /* flags */ 
	voff_t    vrw_off;                /* offset in volume (sectors) */ 
	size_t    vrw_size;               /* number of sectors to Xfer */ 
	caddr_t   vrw_addr;               /* user address for Xfer */ 
The vrw_flags field is currently unused; other fields are explained in the comments. 
The ATOMIC_COPY, VERIFY_READ, and
VERIFY_WRITE ioctls, 
described later is this section, perform special I/O operations against
the volume.  They use the vol_io structure to initiate
I/O requests and receive the status information back.  The members of
the vol_io structure are:
	voff_t    vi_offset;             /* 0x00 offset on plex */ 
	size_t    vi_len;                 /* 0x04 amount of data to read/write */ 
	caddr_t   vi_buf;                /* 0x08 ptr to buffer */ 
	size_t    vi_nsrcplex;           /* 0x0c number of source plexes */ 
	size_t    vi_ndestplex;           /* 0x10 number of destination plexes */ 
	struct    plx_ent *vi_plexptr;  /* 0x14 ptr to array of plex entries */ 
	ulong_t   vi_flag;               /* 0x18 flags associated with op */ 
The members of the plx_ent structure are: 
	char      pe_name[Name_SZ];       /* name of plex */ 
	int       pe_errno;              /* error number against plex */ 
The vi_offset value specifies the sector offset of the I/O within the volume.  It must be within the address range of the volume. Also, the entire range of the I/O from vi_offset to vi_offset + vi_len must be within the address range of the volume. 
The vi_len field specifies the length of the I/O in sectors. It must be a between 0 and 120 sectors (VOL_MAXSPECIALIO). 
The vi_buf field is a pointer to a buffer of vi_len sectors. The  VERIFY_WRITE ioctl writes the data stored in this buffer. 
The vi_nsrcplex field is the number of source plexes available for the operation and the vi_ndestplex field is the number of destination plexes available for the operation.  The vi_nsrcplex and vi_ndestplex values must be between 0 and 8 (PLEX_NUM). 
The vi_plexptr is a pointer to an array of plx_ent structures.  The first vi_nsrcplex entries in the array are source plexes.  The pe_name contains the name of the plex.  If the name of the first source entry is the null string, then the kernel selects all plexes available for reading as part of the volume and fills in the pe_name fields. 
After the source plexes, the next vi_ndestplex entries are the destination plexes.  If the name of the first destination entry is the null string, then the kernel selects all plexes available for writing as part of the volume and fills in the pe_name fields. 
After the I/O operations are performed, the plx_ent structures are copied back to the user.  If the kernel selected the plexes, the names of the selected plexes are in the pe_name fields.  The status of the operations on each plex are stored in the pe_errno field of the plx_ent structure for that plex. 
The pe_errno field is 0 if the operation succeeded against the plex.  If pe_errno isn't 0, then the error code  indicates what happened to the plex.  The possible values for pe_errno are: 
- ENOENT 
- The specified plex isn't associated with the volume the ioctl was issued against. 
 
- EACCES 
- The specified plex is in the disabled state, so no I/O can be performed against it. 
 
- ENXIO 
- An error was detected in the operation, so no I/O operation was attempted to the plex.  If one plx_ent 
structure in a list contains a bad name, then no I/O is done.  All plx_ent structures with a 
valid name have their pe_errno set to ENXIO to indicate no I/O was attempted. 
 
- EFAULT 
- A source plex is sparse and doesn't have blocks that map the entire I/O request. 
 
- EIO 
- A read I/O error was returned against a source plex. 
 
- EROFS 
- A write I/O error was returned against a destination plex. 
 
- ESRCH 
- The  VERIFY_READ and VERIFY_WRITE operations compare the data from different plexes 
against each other to verify the consistency.  If the comparison detects an error, the plex 
that was read first is considered correct.  An  ESRCH error is returned against the plex that 
was read second to indicate it contains bad data. 
 
For the 
ATOMIC_COPY, VERIFY_READ, and
VERIFY_WRITE ioctls, 
if the entire operation is a
success, then a 0 is returned. If there is a fatal error, a -1 is
returned and the external variable errno indicates the
reason for failure.  If a 1 is returned, there was some sort of failure
and the net results must be determined by examining the
pe_errno fields of all the plexes.
The ioctls that do volume-special I/O are: 
- ATOMIC_COPY 
- This ioctl takes a pointer to a vol_io structure as an argument. It reads vi_len sectors of data, at offset 
vi_offset, from one of the plexes specified by the first vi_nsrcplex plex entries into a 
buffer.  Then it writes the contents of the buffer onto all the plexes specified by the next 
vi_ndestplex plex entries at the same offset.  This entire operation is atomic with respect 
to the I/O stream of the volume.  The vi_buf field is unused and should be NULL. 
 
- If the first source plex entry has a null string for the name, the kernel selects from any plex of the 
volume that is enabled for read access.  The names of any selected plexes are copied into 
the appropriate plex entries. 
 
- If the first destination plex entry has a null string for the name, the kernel will write to all plexes of 
the volume that are enabled for write access.  The names of selected plexes are copied into 
the appropriate plex entries. 
 
- When the list of source plexes has been compiled, the kernel tries to read each plex in order.  A plex 
can't be read if it doesn't have backup storage covering the entire operation.  Once a plex 
has been successfully read, all the destination plexes are written.  The writes can succeed 
even if the destination plex is sparse and doesn't have backup storage to cover the entire 
write. 
 
- If the ioctl returns a value of -1, then some error has occurred which prevented the  
ATOMIC_COPY from working.  If the return value is 0, then everything worked fine.  If 
the return value is 1, then the pe_errno field must be examined to determine the errors on 
each individual plex.  The status of the overall operation depends on these individual errors. 
 
- VERIFY_READ 
- This command accepts a pointer to a vol_io structure as an argument.  It reads vi_len sectors, from 
offset vi_offset, on the first vi_nsrcplex plexes specified by the vi_plexptr array of plex 
entries.  It compares the data from each plex. If any plex is different from the previous 
plexes read, the pe_errno value for that plex is set to ESRCH. This entire operation is 
atomic with respect to the I/O stream of the volume. 
 
- The vi_ndestplex field must be 0.  If the vi_buf field is not  NULL, then the data read from the 
plexes not marked with ESRCH is copied into the buffer specified by vi_buf. 
 
- As each plex is read, any data that has already been read is compared against the previous reads.  If 
all the data for the plex passes, then any data that was read for the first time is copied into 
the comparison buffer.  This allows  VERIFY_READ operations to work if the volume 
contains sparse plexes.  The data not represented by backup storage is not compared against 
anything. 
 
- VERIFY_WRITE 
- This command accepts a pointer to a vol_io structure as an argument.  It writes vi_len sectors to 
offset vi_offset on the first vi_ndestplex plexes specified by the vi_plexptr array of plex 
entries.  The data to be written is stored in the buffer pointed to by vi_buf.  Then, the data 
is read back from each plex that was successfully written and is compared against the data 
written.  If the data from any plex doesn't match the data written, the pe_errno value for 
the plex is set to ESRCH. This entire operation is atomic with respect to the I/O stream of 
the volume. 
 
- The vi_nsrcplex field must be 0. 
 
- As each plex is read, any data that doesn't have backup storage on that plex is filled in from the write 
buffer.  This allows VERIFY_WRITE operations to work if the volume contains sparse 
plexes.  The data not represented by backup storage will always succeed. 
 
- VOL_LOG_WRITE 
- This command takes no argument, and causes the log for a volume to be written to disk immediately. 
This command is useful for making sure that the on-disk images of the log have been 
written. 
 
- This command returns -1 with errno set to  EINVAL if the specified volume does not have logging 
enabled. 
 
- PLEX_DETACH 
- This command allows an enabled plex to be detached. The argument is the name of the plex to 
detach. 
 
 Diagnostics
The following errors are returned by the volume and plex virtual disk device interfaces: 
- ENXIO 
- A validation error occurred during an I/O operation.  If this error is returned, the driver attempted no 
I/O. 
 
- EIO 
- A physical I/O error occurred during an operation.  If this error is returned, the driver tried an I/O 
and it failed. 
 
- EAGAIN 
- A needed kernel resource couldn't be obtained. 
 
- EFAULT 
- A pointer passed to the kernel was invalid, causing a bad memory reference. 
 
- EINVAL 
- Invalid data was passed to the kernel.  Some field in a vol_io structure failed a sanity check. 
 
- EBADF 
- An attempt was made to write a volume that wasn't opened for writing or read a volume that wasn't 
open for reading. 
 
- ENOENT 
- An object named in  GET_VOL_STATS ioctl was not associated with the volume. 
 
- ENOENT 
- The kernel was asked to supply a list of destination plexes for an ATOMIC_COPY or 
VERIFY_WRITE ioctl.  If no enabled plexes available in write mode are found, an 
ENOENT error is returned. 
 
- ESRCH 
- The kernel was asked to supply a list of source plexes for an ATOMIC_COPY or VERIFY_READ 
ioctl.  If no enabled plexes available in read mode are found, an ESRCH error is returned. 
 
- ENFILE 
- The kernel was asked to supply a list of source plexes for an ATOMIC_COPY or VERIFY_READ 
ioctl.  If more than vi_nsrcplex enabled plexes available in read mode are found, an 
ENFILE error is returned. 
 
- EMFILE 
- The kernel was asked to supply a list of destination plexes for an ATOMIC_COPY or 
VERIFY_WRITE ioctl.  If more than vi_ndestplex enabled plexes available in write 
mode are found, an EMFILE error is returned. 
 
 Files
- /dev/vx/dsk* 
- Volume block device files.
 
- /dev/vx/rdsk* 
- Volume character (raw) device files.
 
 References
fsck_vxfs(ADM),
ioctl(S),
vxconfig(HW),
vxiod(HW),
vxtrace(HW)
 
Copyright © 2005 The SCO Group, Inc. All rights reserved.