Finally, have you tried adapting my changes back to single-precision
with 'fsels'? On processors where single-precision fp is faster than
double, you should get a measurable win out of this. (G5 appears not
to be, alas...) The magic bit-twiddling might also run a hair faster
in single-precision.
IIRC, double precision math being as fast as single precision math is
one of the improvements that came with the G4 class, so this would be
a G3 specific optimization in the end.
The individual operations might be as fast, but you need to more of
them to get the extra precision for doubles.