Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Floating Point comparison G5 vs. Opteron (64-bit) question




On Jan 9, 2005, at 2:59 PM, Marco Scheurer wrote:



On Jan 9, 2005, at 20:33, David Gohara wrote:

[...] Is my interpretation of what is occurring correct, or is the performance difference due to something entirely different? If it is the case, are there compiler directives or procedures that can be used to increase the floating point performance (throughput?) on the G5 via Altivec? That is, without going through and hand vectorizing all of the various routines that are slow.

Isn't automatic vectorization of the code a promised feature of Tiger (at least for C) ? We were told this in a public seminar.

Yes, that is a ongoing project for GCC-4.0. You should probably ask on the gcc list what they expect to deliver in the near term. Please adjust expectations appropriately, however. An autovectorizer can't autovectorize well if you have a vector-hostile data layout or present it with a pile of aliasing problems it can't figure out. In addition, since AltiVec hardware doesn't do double precision, any autovectorizer would also not do double precision.


Its possible that a autovectorizing compiler algorithm might be applied to a double precision scalar FPU to get some possibly significant speed wins due the way caches and pipelines work. (The G5's dual FPUs can at one level be approximated as a single unpipelined 768-bit double precision vector unit.) At least this is what we see when we back port our hand tuned vector algorithms to the scalar domain and compare with performance of the original scalar. Its not an obvious thing to do, however, so I don't know whether they are working on that or not.

I suspect that in the future using an autovectorizing compiler (well) will require some code inspection to see what the compiler did, then refactor your scalar code to clear a few details up for the compiler so it can do the optimization that really needs to be done. As mentioned below, this is currently required much of the time to get good performance out of the current compiler for scalar codegen.

One other thing that I'll point out that caused me to think it was the SSE/SSE2 usage. If I use the FFT's in vDSP for the portion of the FFT calculation are 2x faster than on the Opteron. If I compile the application in 32-bit mode on the Opteron the G5 FFT ends up being 4-6x times faster.

SSE2 or really any 2-way vector unit generally aren't as much of a win as people suspect. You can only fit 2 doubles in a vector so at most you are getting at most a 2x speed win. Its not like the traditional 4,8 or 16x win people usually expect from a vector unit. On P4/Xeon, only one double is processed per cycle (their vector unit is internally only 64 bits wide for most stuff) so its not a win at all for peak throughput over x87. I'm not sure about the SSE2 bandwidth for Opteron. A lot of the SSE2 win is just getting away from x87 and its stack based register file. 64-bit mode on Opteron also allows you to use more registers and more modern function calling conventions. That is no doubt behind some of the speed you are seeing. x87/SSE2 doesn't have a fused multiply add core, so it takes twice as many instructions to get some stuff done as on a G5.


On top of that, the G5 has two scalar FPUs. Two scalar FPUs have the same peak throughput as a hypothetical AltiVec unit that does double precision, but without the permute overhead. Since you don't have to program to vector APIs to use it, its a pretty good deal. However, keeping all that horsepower busy isn't automatic. There is frequently a lot to be gained from explicitly exposing 12+ way parallelism in your code on G5. Its difficult for compilers to do that on their own.

Ian
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Floating Point comparison G5 vs. Opteron (64-bit) question (From: David Gohara <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Marco Scheurer <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.