(gmp.info.gz) Single Limb Division

 
 Single Limb Division
 --------------------
 
 Nx1 division is implemented using repeated 2x1 divisions from high to
 low, either with a hardware divide instruction or a multiplication by
 inverse, whichever is best on a given CPU.
 
    The multiply by inverse follows section 8 of "Division by Invariant
 References::) and is implemented as `udiv_qrnnd_preinv' in
 `gmp-impl.h'.  The idea is to have a fixed-point approximation to 1/d
 (see `invert_limb') and then multiply by the high limb (plus one bit)
 of the dividend to get a quotient q.  With d normalized (high bit set),
 q is no more than 1 too small.  Subtracting q*d from the dividend gives
 a remainder, and reveals whether q or q-1 is correct.
 
    The result is a division done with two multiplications and four or
 five arithmetic operations.  On CPUs with low latency multipliers this
 can be much faster than a hardware divide, though the cost of
 calculating the inverse at the start may mean it's only better on
 inputs bigger than say 4 or 5 limbs.
 
    When a divisor must be normalized, either for the generic C
 `__udiv_qrnnd_c' or the multiply by inverse, the division performed is
 actually a*2^k by d*2^k where a is the dividend and k is the power
 necessary to have the high bit of d*2^k set.  The bit shifts for the
 dividend are usually accomplished "on the fly" meaning by extracting
 the appropriate bits at each step.  Done this way the quotient limbs
 come out aligned ready to store.  When only the remainder is wanted, an
 alternative is to take the dividend limbs unshifted and calculate r = a
 mod d*2^k followed by an extra final step r*2^k mod d*2^k.  This can
 help on CPUs with poor bit shifts or few registers.
 
    The multiply by inverse can be done two limbs at a time.  The
 calculation is basically the same, but the inverse is two limbs and the
 divisor treated as if padded with a low zero limb.  This means more
 work, since the inverse will need a 2x2 multiply, but the four 1x1s to
 do that are independent and can therefore be done partly or wholly in
 parallel.  Likewise for a 2x1 calculating q*d.  The net effect is to
 process two limbs with roughly the same two multiplies worth of latency
 that one limb at a time gives.  This extends to 3 or 4 limbs at a time,
 though the extra work to apply the inverse will almost certainly soon
 reach the limits of multiplier throughput.
 
    A similar approach in reverse can be taken to process just half a
 limb at a time if the divisor is only a half limb.  In this case the
 1x1 multiply for the inverse effectively becomes two (1/2)x1 for each
 limb, which can be a saving on CPUs with a fast half limb multiply, or
 in fact if the only multiply is a half limb, and especially if it's not
 pipelined.
Info Catalog
(gmp.info.gz) Division Algorithms
(gmp.info.gz) Basecase Division
automatically generated byinfo2html