(gmp.info.gz) Assembler Writing Guide
Info Catalog
(gmp.info.gz) Assembler Loop Unrolling
(gmp.info.gz) Assembler Coding
Writing Guide
-------------
This is a guide to writing software pipelined loops for processing limb
vectors in assembler.
First determine the algorithm and which instructions are needed.
Code it without unrolling or scheduling, to make sure it works. On a
3-operand CPU try to write each new value to a new register, this will
greatly simplify later steps.
Then note for each instruction the functional unit and/or issue port
requirements. If an instruction can use either of two units, like U0
or U1 then make a category "U0/U1". Count the total using each unit
(or combined unit), and count all instructions.
Figure out from those counts the best possible loop time. The goal
will be to find a perfect schedule where instruction latencies are
completely hidden. The total instruction count might be the limiting
factor, or perhaps a particular functional unit. It might be possible
to tweak the instructions to help the limiting factor.
Suppose the loop time is N, then make N issue buckets, with the
final loop branch at the end of the last. Now fill the buckets with
dummy instructions using the functional units desired. Run this to
make sure the intended speed is reached.
Now replace the dummy instructions with the real instructions from
the slow but correct loop you started with. The first will typically
be a load instruction. Then the instruction using that value is placed
in a bucket an appropriate distance down. Run the loop again, to check
it still runs at target speed.
Keep placing instructions, frequently measuring the loop. After a
few you will need to wrap around from the last bucket back to the top
of the loop. If you used the new-register for new-value strategy above
then there will be no register conflicts. If not then take care not to
clobber something already in use. Changing registers at this time is
very error prone.
The loop will overlap two or more of the original loop iterations,
and the computation of one vector element result will be started in one
iteration of the new loop, and completed one or several iterations
later.
The final step is to create feed-in and wind-down code for the loop.
A good way to do this is to make a copy (or copies) of the loop at the
start and delete those instructions which don't have valid antecedents,
and at the end replicate and delete those whose results are unwanted
(including any further loads).
The loop will have a minimum number of limbs loaded and processed,
so the feed-in code must test if the request size is smaller and skip
either to a suitable part of the wind-down or to special code for small
sizes.
Info Catalog
(gmp.info.gz) Assembler Loop Unrolling
(gmp.info.gz) Assembler Coding
automatically generated byinfo2html