Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Floating Point comparison G5 vs. Opteron (64-bit) question




On Jan 9, 2005, at 1:55 PM, David Gohara wrote:


Also, you mention the use of using prefetch hints. So for example. When performing the calculation across the row (stride=1) would the G5 automatically enable prefetching (I recall a discussion here about detection of data streaming or such). But then when the calculation is performed down the columns, the algorithm first copies a columns worth of data into a row and then passes that in for the FFT. Is there a better of way doing this with the scalar implementation? And how would this compare to what vecLib does for a 2d transform. Presumably at some point every algorithm is going to have to pull in scattered data and reorganize it. If prefetching would help, are there code examples of how one would include them somewhere (I remember looking a while back on google and the developer site, but don't recall seeing anything)?

The G5 has an automatic prefetch engine to prefetch data ahead of time. In essence, designing an automatic prefetch engine is asking hardware designers to predict the future (even if it is just 1 microsecond away) so there is a limit to what they can do. Generally speaking the engine spots data access patterns and then extrapolates from there to what you are likely to do. The two patterns it picks up are in-order ascending and descending cachelines. In short if you operate forward or backwards linear through memory, the G5 will probably issue a prefetch stream automatically and stream in the data for you.


If you are skipping around then you might be out of luck. There are 4 hardware prefetch engines (there are actually 8 but 4 of them are by default configured to service dst) available to stream in data concurrently. If you skip around inside four otherwise linear streams, things might still work for you. If you are skipping around 512 independent linear streams (like a FFT data column in C storage order) then you are probably not serviced well by the automatic hardware prefetcher.

Please see section 3.6.4.3 of the PowerPC 970 user manual for a description of how the automatic prefetch engine works.

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/ AE818B5D1DBB02EC87256DDE00007821/$file/970FX_user_manual_v1.41.pdf

How exactly is your data stored? You are in F77? Is this a row in a 1D array or a row in a 2D array that you are copying into?

You can issue your own prefetch hints. The way to do that is to issue a dcbt instruction (typically using a C intrinsic such as __builtin_prefetch()) a large number of cycles before you need the data.

http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

I don't know what if any facility exists in F77 for this.

"Large" here is implementation dependent but lets say at least 100, which is something like 4-8 loop iterations ahead in a most code, but might be a lot more if your loop is small. Experiment with different lead times to find the best one. If you get it wrong (or the data is already in cache or the hardware prefetcher already got it) then you will see no speed improvement or a slight speed loss, if you get it right then your FSB bottlenecked routine will probably go 30% faster or so, though up to 4x is possible in very rare cases. Be aware that there is a limit to how fast the front side bus can go, and it is a lot slower than how fast the FPU can go. Prefetching doesn't make the FSB go faster, it just gives it something to do when there isn't an immediate demand load to service, so you just get to use more of it more of the time. In short, it helps but unless your calculation has a very high FPU to LSU load (FFT isn't that way) then prefetching alone isn't going to get you close to saturating the FPU. The cpu can only store eight outstanding cache misses in the load miss queue. You can't issue 100 dcbt's all at once and expect them all to get serviced. AFAIK the ones that don't fit simply get ignored.

My dim recollection is that vecLib breaks down 2D transformations into one big 1D transformation spanning all the data, but there have been a lot of changes lately. Eric Postpichil can better answer that question.

Configure Shark to sample on L2 cache misses. That should show you which parts of which routines need prefetch hints. Don't bother with functions that don't also take up appreciable CPU time.

Ian


_______________________________________________ Do not post admin requests to the list. They will be ignored. PerfOptimization-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Floating Point comparison G5 vs. Opteron (64-bit) question (From: David Gohara <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Ian Ollmann <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: David Gohara <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.