[...] Is my interpretation of what is occurring correct, or is the
performance difference due to something entirely different? If it is
the case, are there compiler directives or procedures that can be
used to increase the floating point performance (throughput?) on the
G5 via Altivec? That is, without going through and hand vectorizing
all of the various routines that are slow.
Isn't automatic vectorization of the code a promised feature of Tiger
(at least for C) ? We were told this in a public seminar.
Yes, that is a ongoing project for GCC-4.0. You should probably ask on
the gcc list what they expect to deliver in the near term. Please
adjust expectations appropriately, however. An autovectorizer can't
autovectorize well if you have a vector-hostile data layout or present
it with a pile of aliasing problems it can't figure out. In addition,
since AltiVec hardware doesn't do double precision, any autovectorizer
would also not do double precision.
Its possible that a autovectorizing compiler algorithm might be applied
to a double precision scalar FPU to get some possibly significant speed
wins due the way caches and pipelines work. (The G5's dual FPUs can at
one level be approximated as a single unpipelined 768-bit double
precision vector unit.) At least this is what we see when we back port
our hand tuned vector algorithms to the scalar domain and compare with
performance of the original scalar. Its not an obvious thing to do,
however, so I don't know whether they are working on that or not.
I suspect that in the future using an autovectorizing compiler (well)
will require some code inspection to see what the compiler did, then
refactor your scalar code to clear a few details up for the compiler so
it can do the optimization that really needs to be done. As mentioned
below, this is currently required much of the time to get good
performance out of the current compiler for scalar codegen.
One other thing that I'll point out that caused me to think it was
the SSE/SSE2 usage. If I use the FFT's in vDSP for the portion of
the FFT calculation are 2x faster than on the Opteron. If I compile
the application in 32-bit mode on the Opteron the G5 FFT ends up
being 4-6x times faster.
SSE2 or really any 2-way vector unit generally aren't as much of a win
as people suspect. You can only fit 2 doubles in a vector so at most
you are getting at most a 2x speed win. Its not like the traditional
4,8 or 16x win people usually expect from a vector unit. On P4/Xeon,
only one double is processed per cycle (their vector unit is internally
only 64 bits wide for most stuff) so its not a win at all for peak
throughput over x87. I'm not sure about the SSE2 bandwidth for Opteron.
A lot of the SSE2 win is just getting away from x87 and its stack
based register file. 64-bit mode on Opteron also allows you to use more
registers and more modern function calling conventions. That is no
doubt behind some of the speed you are seeing. x87/SSE2 doesn't have a
fused multiply add core, so it takes twice as many instructions to get
some stuff done as on a G5.
On top of that, the G5 has two scalar FPUs. Two scalar FPUs have the
same peak throughput as a hypothetical AltiVec unit that does double
precision, but without the permute overhead. Since you don't have to
program to vector APIs to use it, its a pretty good deal. However,
keeping all that horsepower busy isn't automatic. There is frequently a
lot to be gained from explicitly exposing 12+ way parallelism in your
code on G5. Its difficult for compilers to do that on their own.