I've pasted some improved code below, hopefully wil get you the 2x
speedup you're looking for.
<snip>
I successfully compiled the new function atan2_ (thanks, Ben),
replacing the 'address' operator by (intptr_t)& (thanks, Ian).
With gcc optimization -O0 through -O3, atan2_ works correctly with
very small error. However when I tried it with gcc's -fast option,
the maximum error was huge: 0.57 radians!
A little detective work showed that -fstrict-aliasing (a component of
-fast) is the culprit. Evidently the type-punning involving ypbuf
needs to be more robustly specified, for example by making ypbuf a
union
with appropriate obvious changes to the few code lines that reference
ypbuf. In this way the code is easily immunized against the
depredations of -fast.
Comparison of the corrected atan2_ with the official atan2:
atan2_ |error| = 4.440892e-16
atan2 elapse : 0.87 s
atan2_ elapse : 0.64 s
So, splendid accuracy but not yet 2 x faster than atan2 :-(
I don't need anything like full double, or even single, precision.
An absolute error of 1e-4 would be acceptable.
The search for a suitable fast atan2 continues, all candidates so far
having been either too inaccurate, or not fast enough.
Robert P.
Robert,
First of all, you can trivially remove one (and perhaps two) iterations
of the initial inverse square root calculation, depending on your
precision needs. This alone might get you to the factor of 2.
The conversion to 'fsel' mostly improves the case of random input, of
course... are you testing this with randomized or ordered input? (and
what would your final application be?)
Finally, have you tried adapting my changes back to single-precision
with 'fsels'? On processors where single-precision fp is faster than
double, you should get a measurable win out of this. (G5 appears not to
be, alas...) The magic bit-twiddling might also run a hair faster in
single-precision.