Thanks for the thorough response and references. I guess my theory
about floating point units was wrong (wouldn't be the first time...).
If I'm following correctly then it would appear that the initial FFT
calculation is more due to memory access issues/latency and even stride
issues (in this case the FFT is 512x512x256) so there is a power of two
problem as well.
Your analysis is consistent with another observation that, later when
smaller FFT's are calculated in the program (64x128) the performance
discrepancy disappears.
Since the large FFT is only calculated once in the beginning of a run
the overhead is only incurred once. It was just one of those things
that bothered me. That is assuming I'm following you correctly. Does
this sound reasonable?
Also, you mention the use of using prefetch hints. So for example.
When performing the calculation across the row (stride=1) would the G5
automatically enable prefetching (I recall a discussion here about
detection of data streaming or such). But then when the calculation is
performed down the columns, the algorithm first copies a columns worth
of data into a row and then passes that in for the FFT. Is there a
better of way doing this with the scalar implementation? And how would
this compare to what vecLib does for a 2d transform. Presumably at
some point every algorithm is going to have to pull in scattered data
and reorganize it. If prefetching would help, are there code examples
of how one would include them somewhere (I remember looking a while
back on google and the developer site, but don't recall seeing
anything)?
Thanks again for your help,
Dave
On Jan 9, 2005, at 3:49 PM, Ian Ollmann wrote:
The performance varies entirely on the strength of the FFT algorithm
used. For example, looking at the FFTW benchmarks page (which cover a
*very* wide variety of FFTs) it should be clear that with optimum
implementations a G5 can out-perform nearly any other machine but also
that implementations vary more widely than processors.
A 2.0 GHz Opteron peaks out at about 2.5 GFlops, whereas a 2.0 GHz G5
is about 4 GFlops. The Itanium II does very well, but seems to suffer
from frequency issues, at least in that set of test data. Xeon is
competitive. Since vecLib is topping the speed lists here, I should
point out that Eric has done some retuning for Tiger/G5 to nudge these
numbers around a bit, but more on that later. ;-)
That is of course a 1024 FFT. Things are clearly different for larger
problems. You can usually recast larger FFTs into bits that work
better on the processor, but this is of course *work*! :-) For that
reason, it may or may not apply to the FFT near you. As I said, much
of this varies on the strength of the FFT algorithm used. Here is a
description of one such effort to give you an idea of what is
involved:
I think the fundamental issue you are experiencing, David, is probably
that FFTs skip around in data a lot. If you are doing a large one,
then you are almost certainly limited by the rate at which the CPU can
bring scattered data in and out of the processor. Less likely, but
possibly at play, there is also going to be a region where some
processors are in cache and others are not, depending on how much
cache you have. These can make profound differences. Looking at the
FFTW benchmarks page again, you can see how calculation throughput
drops off by an order of magnitude or more as you fall out of cache on
all processors.
On both these counts the G5 may come up a bit short. While data
throughput is very high on a G5, latencies are long too, several
hundred cycles for a full L2 cache miss trip out to DRAM with various
misadventures along the way. (I don't recall the exact count, but it
is up there.) The Opteron has a integrated memory controller on the
CPU. That cuts out the memory controller "middle man" and
*significantly* cuts data access latencies. If all you are doing is
waiting for that latency, then that will have a profound effect on
benchmarks where the limiting factor is just getting scattered data in
and out of the CPU (and nobody takes the time to issue prefetch
hints). This is a technology that will no doubt find its way into all
major processor families in due time, but for the moment Opteron
enjoys a large competitive edge here as competitors play catch up.
The other issue, cache size, seems less likely to be at play in this
case. The G5 has a 512 kB L2. The Opteron is 1MB -- larger but not
huge. On the other hand, if you had tested something like a Power4,
it might mop up here since it can fit a large number of large problems
in cache (up to 128 MB!) that the other processors can't touch. You
don't need a low latency DRAM access when your problem fits entirely
in cache. Of course, you pay a hefty premium for a feature like that.
Once again, many of the issues are probably solvable in software, but
requires some advanced tuning to recast the problem into something
appropriate for the processor's strengths. Apple has done some
research into tuning for larger FFTs, but I suspect that there is more
work ahead.
So, in short, despite the subject line of this article, this problem
has nothing to do with performance of the FPU. If it did, dual
floating point units with fused multiply adds would likely clean up
the competition. ..or so I say. ;-)
Ian
P.S. If you don't have enough real RAM on your system to hold the
problem set, then you'll be paging to disk. Expect to lose a few more
orders of magnitude in performance if that happens.
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list
(email@hidden)
Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/perfoptimization-dev/
email@hidden
This email sent to email@hidden
David W. Gohara, Ph.D.
Harvard Medical School
http://www.scianafilms.com
617-432-1216 (p)
617-432-4360 (f)