The performance varies entirely on the strength of the FFT algorithm
used. For example, looking at the FFTW benchmarks page (which cover a
*very* wide variety of FFTs) it should be clear that with optimum
implementations a G5 can out-perform nearly any other machine but also
that implementations vary more widely than processors.
A 2.0 GHz Opteron peaks out at about 2.5 GFlops, whereas a 2.0 GHz G5
is about 4 GFlops. The Itanium II does very well, but seems to suffer
from frequency issues, at least in that set of test data. Xeon is
competitive. Since vecLib is topping the speed lists here, I should
point out that Eric has done some retuning for Tiger/G5 to nudge these
numbers around a bit, but more on that later. ;-)
That is of course a 1024 FFT. Things are clearly different for larger
problems. You can usually recast larger FFTs into bits that work better
on the processor, but this is of course *work*! :-) For that reason, it
may or may not apply to the FFT near you. As I said, much of this
varies on the strength of the FFT algorithm used. Here is a description
of one such effort to give you an idea of what is involved:
I think the fundamental issue you are experiencing, David, is probably
that FFTs skip around in data a lot. If you are doing a large one, then
you are almost certainly limited by the rate at which the CPU can bring
scattered data in and out of the processor. Less likely, but possibly
at play, there is also going to be a region where some processors are
in cache and others are not, depending on how much cache you have.
These can make profound differences. Looking at the FFTW benchmarks
page again, you can see how calculation throughput drops off by an
order of magnitude or more as you fall out of cache on all processors.
On both these counts the G5 may come up a bit short. While data
throughput is very high on a G5, latencies are long too, several
hundred cycles for a full L2 cache miss trip out to DRAM with various
misadventures along the way. (I don't recall the exact count, but it is
up there.) The Opteron has a integrated memory controller on the CPU.
That cuts out the memory controller "middle man" and *significantly*
cuts data access latencies. If all you are doing is waiting for that
latency, then that will have a profound effect on benchmarks where the
limiting factor is just getting scattered data in and out of the CPU
(and nobody takes the time to issue prefetch hints). This is a
technology that will no doubt find its way into all major processor
families in due time, but for the moment Opteron enjoys a large
competitive edge here as competitors play catch up.
The other issue, cache size, seems less likely to be at play in this
case. The G5 has a 512 kB L2. The Opteron is 1MB -- larger but not
huge. On the other hand, if you had tested something like a Power4, it
might mop up here since it can fit a large number of large problems in
cache (up to 128 MB!) that the other processors can't touch. You don't
need a low latency DRAM access when your problem fits entirely in
cache. Of course, you pay a hefty premium for a feature like that.
Once again, many of the issues are probably solvable in software, but
requires some advanced tuning to recast the problem into something
appropriate for the processor's strengths. Apple has done some research
into tuning for larger FFTs, but I suspect that there is more work
ahead.
So, in short, despite the subject line of this article, this problem
has nothing to do with performance of the FPU. If it did, dual
floating point units with fused multiply adds would likely clean up the
competition. ..or so I say. ;-)
Ian
P.S. If you don't have enough real RAM on your system to hold the
problem set, then you'll be paging to disk. Expect to lose a few more
orders of magnitude in performance if that happens.