Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Floating Point comparison G5 vs. Opteron (64-bit) question




The performance varies entirely on the strength of the FFT algorithm used. For example, looking at the FFTW benchmarks page (which cover a *very* wide variety of FFTs) it should be clear that with optimum implementations a G5 can out-perform nearly any other machine but also that implementations vary more widely than processors.


http://www.fftw.org/speed/

A 2.0 GHz Opteron peaks out at about 2.5 GFlops, whereas a 2.0 GHz G5 is about 4 GFlops. The Itanium II does very well, but seems to suffer from frequency issues, at least in that set of test data. Xeon is competitive. Since vecLib is topping the speed lists here, I should point out that Eric has done some retuning for Tiger/G5 to nudge these numbers around a bit, but more on that later. ;-)

That is of course a 1024 FFT. Things are clearly different for larger problems. You can usually recast larger FFTs into bits that work better on the processor, but this is of course *work*! :-) For that reason, it may or may not apply to the FFT near you. As I said, much of this varies on the strength of the FFT algorithm used. Here is a description of one such effort to give you an idea of what is involved:

http://www2.cs.uh.edu/~johnsson/lacsi_reviewtalk_S.pdf

I think the fundamental issue you are experiencing, David, is probably that FFTs skip around in data a lot. If you are doing a large one, then you are almost certainly limited by the rate at which the CPU can bring scattered data in and out of the processor. Less likely, but possibly at play, there is also going to be a region where some processors are in cache and others are not, depending on how much cache you have. These can make profound differences. Looking at the FFTW benchmarks page again, you can see how calculation throughput drops off by an order of magnitude or more as you fall out of cache on all processors.

On both these counts the G5 may come up a bit short. While data throughput is very high on a G5, latencies are long too, several hundred cycles for a full L2 cache miss trip out to DRAM with various misadventures along the way. (I don't recall the exact count, but it is up there.) The Opteron has a integrated memory controller on the CPU. That cuts out the memory controller "middle man" and *significantly* cuts data access latencies. If all you are doing is waiting for that latency, then that will have a profound effect on benchmarks where the limiting factor is just getting scattered data in and out of the CPU (and nobody takes the time to issue prefetch hints). This is a technology that will no doubt find its way into all major processor families in due time, but for the moment Opteron enjoys a large competitive edge here as competitors play catch up.

http://www.tomshardware.com/cpu/20020424/opteron-05.html

The other issue, cache size, seems less likely to be at play in this case. The G5 has a 512 kB L2. The Opteron is 1MB -- larger but not huge. On the other hand, if you had tested something like a Power4, it might mop up here since it can fit a large number of large problems in cache (up to 128 MB!) that the other processors can't touch. You don't need a low latency DRAM access when your problem fits entirely in cache. Of course, you pay a hefty premium for a feature like that.

http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/ power4_5.html

Once again, many of the issues are probably solvable in software, but requires some advanced tuning to recast the problem into something appropriate for the processor's strengths. Apple has done some research into tuning for larger FFTs, but I suspect that there is more work ahead.

http://images.apple.com/acg/pdf/g4fft.pdf
http://images.apple.com/acg/pdf/20040827_GigaFFT.pdf

So, in short, despite the subject line of this article, this problem has nothing to do with performance of the FPU. If it did, dual floating point units with fused multiply adds would likely clean up the competition. ..or so I say. ;-)

Ian

P.S. If you don't have enough real RAM on your system to hold the problem set, then you'll be paging to disk. Expect to lose a few more orders of magnitude in performance if that happens.




_______________________________________________ Do not post admin requests to the list. They will be ignored. PerfOptimization-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Floating Point comparison G5 vs. Opteron (64-bit) question (From: David Gohara <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.