Automating frequent tasks

How to control program performance

As mentioned earlier, in any shell script, 90% of the computational load is imposed by about 10% of the script. The bottlenecks to look out for are as follows:

Loops, especially the main program loop. A process which is called repeatedly imposes a heavy load on the computer. Most shell script loops are extremely heavy users of computer resources because they exec programs several times in rapid succession.
File access (and reads and writes directed through named pipes). Because the computer's hard disk is several orders of magnitude slower than its memory, any procedure that involves heavy disk I/O will invariably impose a heavy load on the system.
Processes. Many commands are built into the shell; but those which are not require the system to load and execute a program. This has two consequences; a disk access is required, and an additional process is run (diverting resources from any other processes which are being executed concurrently).
Size of data. It should be obvious that as the files that are being processed by a filter grow longer, all processes involving the file take longer. However, the relationship between file size and time is not fixed; big files may take much longer to process than several small files containing the same total amount of information.

To improve the performance of a shell script, you need to be constantly aware of these considerations. Any activity that takes place in a main loop is likely to yield a big performance improvement if you can find a way to reduce the amount of disk I/O or number of processes it requires. Activities that require a large data file may be speeded up by switching to several smaller files, if possible. (A small file is one that is less than eight or ten kilobytes long; for technical reasons such files can be opened and scanned more rapidly than larger files.)

The standard development cycle, which should be applied to shell procedures as to other programs, is to write code, get it working, thoroughly test it, measure it, and optimize the important parts (outlined above), looping back to earlier stages wherever necessary. The time(C) command is a useful tool for optimizing shell scripts. time is used to establish how long a command took to execute:

   $ time ls
   real	0m0.06s
   user	0m0.03s
   sys	0m0.03s

The values reported by time are the elapsed time during the command (the real time); the time the system took to execute the system calls within the command (the ``sys'' time); and the time spent processing the command itself (the user time). In practice, only the first value, the real time, is relevant at this level. Note that this is the output from the Korn shell's built-in time command; the Bourne shell output may vary. (If you have the Development System, the timex(ADM) command offers additional facilities.)

Because the SCO OpenServer system is multi-tasking, it is impossible to accurately judge how long a program is taking to run by any other means; a seemingly slow process may be the result of an unusually heavy load being placed on the computer by some other user or process. Each timing test should be run several times, because the results are easily disturbed by variations in system load.

A useful technique is to encapsulate the body of a loop within a function, so that the sole activity within the loop is to call that function; you can then time the function, and time the loop as a whole. Alternatively, you can time individual steps in the process to see which of them are taking longest.