Has something changed? I recently profiled some G3 code, and
NOW... Shark 4.3.2 reports that __fres() has a 14-cycle latency and/
or is not pipelined! Quite obviously the code generated by GCC 3.3
thinks this instruction has a 5-cycle latency, the same as
__frsqrtes(), __fnmsubs(), etc. Can I get some clarification
regarding ppc, ppc7400, ppc970 behavior of this intrinsic? I
understand its accuracy, but now certain G3 versions of my vec
functions appear to have huge bubbles. :\
As far as I know, fres has never been pipelined. frsqrte and vrefp
have been pipelined and have latency similar to multiply. You can
use the frsqrte to do a pipelined divide with a bit of ingenuity, but
since it doesn't accept negative arguments, it is a bit more work. I
wouldn't make any assumptions about code scheduling in GCC 3.3 with
any asm in ppc_intrin.h. In my experience, it presumes it has a one
cycle latency and can make some pretty astounding scheduling choices.
At times, we were forced to write large segments as asms to defeat
bad scheduling. Perhaps there are improvements since then that I am
not aware of. My workflow switched over to GCC 3.5 then GCC 4 pretty
early on to support Intel and ppc64. GCC 3.3 continued on for a while
after that.
Some G3 (the later ones from IBM, but as I understand it, not all G3)
deliver better than required accuracy for fres and frsqrte -- about
12 bits, better than the 8 and 5 bits they are required to have. I am
not aware of a available test to determine which type of G3 that you
have. The danger is of course that you will optimize your code to
work correctly on the 12 bit flavor and return insufficiently refined
results on older G3.