Driver testing and debugging

Common driver problems

Following is a discussion of some common drivers bugs, with possible symptoms. These should be used only as suggestions. Each driver is unique and will have unique bugs.

Coding problems

Simple coding problems usually show up when you try to compile the driver. In general, these are similar to coding problems for any C program, such as failure to #include necessary header files, define all data structures, or properly delineate comment lines. Specific coding errors unique to driver code include the following:

Memory-mapped device registers must be declared volatile so the compiler knows the values may change outside of program control. Otherwise, it may cache the values in local registers or memory and not see changes in hardware state.

Boot problems

Boot problems refer to problems that prevent a system boot with your device configured. If the system won't boot, first try to boot it without the driver to verify that the driver is the problem. Some driver problems that prevent a system boot include:

Cannot call open() routine (DDI)

If the driver loads but fails when it calls the open(D2) routine, this indicates that the kernel cannot access the driver.

For DDI drivers, compare the ``major'' number displayed by the ls -l and modadmin -S commands. A mismatch may occur if you try to configure the kernel without using the idinstall command, or if you try to force a major number either by setting a major number other than 0 in the Master(DSP/4dsp) file or if you use the option to idinstall that preserves the previous major number.

This can also be caused when an entry point routine that is called before open( ) does not return 0. To detect this, run the following command:

   truss cat /dev/drv_node
where drv_node is a node for the driver. Do not use redirection for the command, although you may want to redirect the stderr output to a file. Look at the result of the open( ) system call. If it fails with errno set to ENODEV (19), check the devinfo(D2) entry point routine and the CFG_ADD or CFG_VERIFY subfunctions to config(D2) to be sure that they are returning 0 when they should.

You can also use kdb in the following way:

  1. Set a breakpoint at the config( ) and devinfo( ) entry point routines.

  2. Continue operation.

  3. From the shell, load the driver, which will hit the breakpoint set in Step #1.

  4. From kdb, singlestep through the driver (for example, use 100 SS) until the config( ) or devinfo( ) routine returns to its caller.

  5. Examine the value of the EAX register at that instruction. If it is non-0, the driver is not returning 0.

Data structure problems

A driver can corrupt the kernel data structures. If the driver is setting or clearing the wrong bits in a device register, a write operation may put bad data on the device and a read operation may put bad data anywhere in the kernel. Such errors may affect other drivers on the system.

Finding this bug involves painstaking walk-throughs of the code. Look for a place where a pointer is freed (or never set) before the driver tries to use it, or places where the code forgets to check a flag before accessing a certain structure. Hardware debugging techniques such as logic analyzers, emulator pods, or oscilliscopes can help determine where and when the memory accesses occurred.

Hardware timing errors

Timing errors occur when the driver code executes too quickly or too slowly for the device being driven. For instance, the driver might read a status register on a device too soon after sending the device a command. The device may not have had time to update the status register, so the status register is perceived by the driver to be all 0 bits when, in fact, the device may just be slow in posting the correct status register setting.

When testing the driver, it is useful to verify that a simple, single interrupt is being handled properly. After this is confirmed, you should check that the interrupt handler can handle a number of interrupts that happen at almost the same time.

Corrupted interrupt stack

If a driver's interrupt handler runs at an execution level lower than the corresponding IPL for the device, the processing of one interrupt may be interrupted by a second interrupt from the same device. This will seriously corrupt the interrupt stack, which may cause the system to panic with a stack fault or kernel address fault. Sometimes, however, it will only cause random operational irregularities, which can make this a difficult problem to detect. You can identify this problem by looking at the interrupt stack in the system dump. If it is corrupted, check the execution level of the driver's interrupt handling routine.

System configuration problems

When debugging a driver, it is wise to check the system configuration carefully, especially for a driver that runs fine on one system and fails on another system.

Devices on busses such as ISA that do not self-configure require some manual configuration, which can lead to errors even for devices on self-configuring busses on the system. For example, if an ISA device is assigned to a specific IRQ or memory address but the software configuration does not properly reflect this, the PCI BIOS may assign another device to the same resource. This can cause various errors such as interrupts that are never received despite a successful call to the cm_intr_attach(D3) (DDI) or idistributed(D3oddi) (ODDI) function.

Accessing critical data

Be sure that all critical code sections are protected with appropriate synchronization primitives. See ``Critical code section'' and ``Synchronization primitives''.

Overuse of local driver storage

If the driver routines use large amounts of local (automatic) storage, they may exceed the bounds of the kernel stack or the interrupt stack, which in turn will panic the system. 200 bytes per routine is probably excessive. This usually happens because the code uses too much local storage such as large arrays or large numbers of arrays rather than allocating memory with kmem_alloc( ) and similar kernel functions.

The following script can identify such occurrences. This script depends on the specific output format of the dis command and specific behaviors of the compiler, but works for SVR5, SCO OpenServer 5, and SCO SVR5 2.X.

   cd /etc/conf/pack.d
   find . -name '*.[oa]' | while read f; do
     echo file $f
     dis $f
   done 2>/dev/null | awk '
     /\(\)$/      { fun = $1 }
     $1 == "file" { file = $2 }
     /subl.*esp$/ {
                    sub(/\$0x/, "0x", $NF)
                    sub(/,.*/, "", $NF)
                    print $NF " " file " " fun " "
     } '

Incorrect DMA address mapping

Failure to set up address mapping correctly for DMA transfers is another common mistake. On a read operation, a bad address map may cause data to be placed in the wrong location in memory, overwriting whatever is there including, for example, a portion of the operating system code.

Unregistered interrupt handlers in loadable DDI drivers

A loadable DDI driver must register its interrupt handler by calling the cm_intr_attach(D3) function (or, for HBA drivers, the sdi_intr_attach(D3sdi) function) before the interrupts are enabled on the device. The driver must also allocate and initialize all resources that are needed to service the interrupt For DDI 8 drivers, this is done in the CFG_ADD subfunction to the config(D2) entry point routine; loadable drivers written for earlier DDI releases attach interrupts in their _load(D2) entry point routines; see ``Interrupt handlers, attaching and registering''.

If a driver enables interrupts on a board befeore it registers its interrupt handler, it creates a race condition that can hang the system. The probability of the race condition occuring increases if the board whose interrupts are just enabled uses level-triggered interrupts and shares its IRQ with another board that uses level-triggered interrupts.

To understand what happens, consider driver zzz, which shares an IRQ with an existing driver called yyy and the following sequence of events:

  1. The yyy driver is properly initialized, its interrupt routine is registered, and its board's interupts are enabled at IRQX.

  2. The zzz driver is properly initialized and its board's interrupts are enabled at IRQX but the interrupt handler is not registered.

  3. IRQX is raised on a device controlled by the zzz driver and, being level triggered, remains high.

  4. The kernel's interrupt-handling code calls each of the registered interupt handlers, currently only those of the yyy driver.

  5. The yyy driver's interrupt-handling code checks to see if one of its boards generated the interrupt. It did not, so the yyy driver simply returns.

  6. The kernel believes that the interrupt must be handled and so returns out of interrupt mode. However, the device controlled by the zzz driver is keeping the IRQ high so steps 5 and 6 repeat themselves forever.

Setting breakpoints in the intr(D2) entry point routine of both the zzz and yyy drivers and then singlestepping through the execution will show the situation: the yyyintr( ) routine is be entered repeatedly and the zzzintr( ) routine is never entered.

To solve the problem, fix the yyy driver code to ensure that the cm_intr_attach(D3) function is called to register the interrupt handler before it sends commands to the devices that enable the board's interrupts.

Unregistered interrupt handlers in ODDI drivers

Non-SCSI SCO OpenServer 5 drivers must use the idistributed(D3oddi) function to register interrupt handlers even for single-threaded drivers. SCSI drivers must use the Sharegister(D3osdi) function with the shareg_ex(D4osdi) structure. Drivers that use the older add_intr_handler(D3oddi) function or the shareg structure on SCO OpenServer 5 systems may not receive interrupts from the device.

Panic on mod_shr_intn() (DDI)

DDI drivers that do not call cm_intr_detach(D3) before being unloaded can cause intermittent panics on mod_shr_intn( ), an internal routine that manages shared interrupts.

The problem may not manifest itself after a reboot and first use of the driver; the device is closed and no interrupts are generated. However, if the module is unloaded and the loaded again and the device is accessed, an interrupt occurs that causes the panic. Because cm_intr_detach( ) was not called when the driver was unloaded, the pointer to the (now unloaded) interrupt routine is left in the kernel's interrupt list for that IRQ. When the driver is loaded again, cm_intr_attach( ) registers the interrupt handler again, but now it appears that the IRQ is shared because the old one is still there. When the device is accessed and an interrupt is generated, the kernel tries to call all the interrupt handlers that are registered, including the uncancelled one that points to a now-invalid address.

Be sure that any path for unloading the driver does detach the interrupt handlers. For DDI 8 drivers, this generally means calling the cm_intr_detach( ) function from the CFG_REMOVE subfunction to the config( ) entry point routine, and calling CFG_REMOVE from the driver's _unload(D2) entry point.

Calling functions at wrong context

System hangs when testing a driver are frequently caused by a driver calling a function from the wrong context. Each entry point executes in a specific context, and many functions can only be called from certain contexts, all identified on the manual pages and discussed in the ``Context of a driver'' article.

The most common error is calling a function that blocks from a context that cannot block. Some common manifestations of this include:

Finding the offending call can be challenging, since the system may lock up enough that you are unable to call the debugger. If you can get into the debugger, try the following to identify the offending code:

From SVR5 kdb

From SCO OpenServer 5 scodb

If you cannot break into the debugger, you will need to power-cycle the system. In this situation, set up a remote serial debugging console (see ``Setting up a remote console for debugging''), trigger the problem again, then break into the debugger.

© 2005 The SCO Group, Inc. All rights reserved.
OpenServer 6 and UnixWare (SVR5) HDK - June 2005