[...] Is my interpretation of what is occurring correct, or is the
performance difference due to something entirely different? If it is
the case, are there compiler directives or procedures that can be used
to increase the floating point performance (throughput?) on the G5 via
Altivec? That is, without going through and hand vectorizing all of
the various routines that are slow.
Isn't automatic vectorization of the code a promised feature of Tiger
(at least for C) ? We were told this in a public seminar.
One other thing that I'll point out that caused me to think it was
the SSE/SSE2 usage. If I use the FFT's in vDSP for the portion of the
FFT calculation are 2x faster than on the Opteron. If I compile the
application in 32-bit mode on the Opteron the G5 FFT ends up being
4-6x times faster.
Any reason not to use vDSP then? Or did I miss something?