Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Floating Point comparison G5 vs. Opteron (64-bit) question



Ian,

Thanks for the thorough response and references. I guess my theory about floating point units was wrong (wouldn't be the first time...).

If I'm following correctly then it would appear that the initial FFT calculation is more due to memory access issues/latency and even stride issues (in this case the FFT is 512x512x256) so there is a power of two problem as well.

Your analysis is consistent with another observation that, later when smaller FFT's are calculated in the program (64x128) the performance discrepancy disappears.

Since the large FFT is only calculated once in the beginning of a run the overhead is only incurred once. It was just one of those things that bothered me. That is assuming I'm following you correctly. Does this sound reasonable?

Also, you mention the use of using prefetch hints. So for example. When performing the calculation across the row (stride=1) would the G5 automatically enable prefetching (I recall a discussion here about detection of data streaming or such). But then when the calculation is performed down the columns, the algorithm first copies a columns worth of data into a row and then passes that in for the FFT. Is there a better of way doing this with the scalar implementation? And how would this compare to what vecLib does for a 2d transform. Presumably at some point every algorithm is going to have to pull in scattered data and reorganize it. If prefetching would help, are there code examples of how one would include them somewhere (I remember looking a while back on google and the developer site, but don't recall seeing anything)?

  Thanks again for your help,

Dave


On Jan 9, 2005, at 3:49 PM, Ian Ollmann wrote:


The performance varies entirely on the strength of the FFT algorithm used. For example, looking at the FFTW benchmarks page (which cover a *very* wide variety of FFTs) it should be clear that with optimum implementations a G5 can out-perform nearly any other machine but also that implementations vary more widely than processors.


http://www.fftw.org/speed/

A 2.0 GHz Opteron peaks out at about 2.5 GFlops, whereas a 2.0 GHz G5 is about 4 GFlops. The Itanium II does very well, but seems to suffer from frequency issues, at least in that set of test data. Xeon is competitive. Since vecLib is topping the speed lists here, I should point out that Eric has done some retuning for Tiger/G5 to nudge these numbers around a bit, but more on that later. ;-)

That is of course a 1024 FFT. Things are clearly different for larger problems. You can usually recast larger FFTs into bits that work better on the processor, but this is of course *work*! :-) For that reason, it may or may not apply to the FFT near you. As I said, much of this varies on the strength of the FFT algorithm used. Here is a description of one such effort to give you an idea of what is involved:

http://www2.cs.uh.edu/~johnsson/lacsi_reviewtalk_S.pdf

I think the fundamental issue you are experiencing, David, is probably that FFTs skip around in data a lot. If you are doing a large one, then you are almost certainly limited by the rate at which the CPU can bring scattered data in and out of the processor. Less likely, but possibly at play, there is also going to be a region where some processors are in cache and others are not, depending on how much cache you have. These can make profound differences. Looking at the FFTW benchmarks page again, you can see how calculation throughput drops off by an order of magnitude or more as you fall out of cache on all processors.

On both these counts the G5 may come up a bit short. While data throughput is very high on a G5, latencies are long too, several hundred cycles for a full L2 cache miss trip out to DRAM with various misadventures along the way. (I don't recall the exact count, but it is up there.) The Opteron has a integrated memory controller on the CPU. That cuts out the memory controller "middle man" and *significantly* cuts data access latencies. If all you are doing is waiting for that latency, then that will have a profound effect on benchmarks where the limiting factor is just getting scattered data in and out of the CPU (and nobody takes the time to issue prefetch hints). This is a technology that will no doubt find its way into all major processor families in due time, but for the moment Opteron enjoys a large competitive edge here as competitors play catch up.

http://www.tomshardware.com/cpu/20020424/opteron-05.html

The other issue, cache size, seems less likely to be at play in this case. The G5 has a 512 kB L2. The Opteron is 1MB -- larger but not huge. On the other hand, if you had tested something like a Power4, it might mop up here since it can fit a large number of large problems in cache (up to 128 MB!) that the other processors can't touch. You don't need a low latency DRAM access when your problem fits entirely in cache. Of course, you pay a hefty premium for a feature like that.

http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/ power4_5.html

Once again, many of the issues are probably solvable in software, but requires some advanced tuning to recast the problem into something appropriate for the processor's strengths. Apple has done some research into tuning for larger FFTs, but I suspect that there is more work ahead.

http://images.apple.com/acg/pdf/g4fft.pdf
http://images.apple.com/acg/pdf/20040827_GigaFFT.pdf

So, in short, despite the subject line of this article, this problem has nothing to do with performance of the FPU. If it did, dual floating point units with fused multiply adds would likely clean up the competition. ..or so I say. ;-)

Ian

P.S. If you don't have enough real RAM on your system to hold the problem set, then you'll be paging to disk. Expect to lose a few more orders of magnitude in performance if that happens.




_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/ email@hidden


This email sent to email@hidden


David W. Gohara, Ph.D.
Harvard Medical School
http://www.scianafilms.com
617-432-1216 (p)
617-432-4360 (f)

_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Floating Point comparison G5 vs. Opteron (64-bit) question (From: David Gohara <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Ian Ollmann <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.