I've pasted some improved code below, hopefully wil get you the 2x
speedup you're looking for.
<snip>
I successfully compiled the new function atan2_ (thanks, Ben),
replacing the 'address' operator by (intptr_t)& (thanks, Ian).
With gcc optimization -O0 through -O3, atan2_ works correctly with very
small error. However when I tried it with gcc's -fast option, the
maximum error was huge: 0.57 radians!
A little detective work showed that -fstrict-aliasing (a component of
-fast) is the culprit. Evidently the type-punning involving ypbuf needs
to be more robustly specified, for example by making ypbuf a union
with appropriate obvious changes to the few code lines that reference
ypbuf. In this way the code is easily immunized against the
depredations of -fast.
Comparison of the corrected atan2_ with the official atan2:
atan2_ |error| = 4.440892e-16
atan2 elapse : 0.87 s
atan2_ elapse : 0.64 s
So, splendid accuracy but not yet 2 x faster than atan2 :-(
I don't need anything like full double, or even single, precision. An
absolute error of 1e-4 would be acceptable.
The search for a suitable fast atan2 continues, all candidates so far
having been either too inaccurate, or not fast enough.