Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XCode 2.2.1 / gcc 4.0 Peephole Bug




On Apr 19, 2006, at 1:27 PM, Shaun Wexler wrote:

On Apr 19, 2006, at 1:11 AM, Ben Weiss wrote:

Given the Altivec function:

vector unsigned short peepholebug(vector unsigned short a, vector unsigned short b) {
vector unsigned short mask = (vector unsigned short)vec_cmplt(a, b);

if (vec_all_ge(a, b)) return a;

return mask;
}


XCode 2.2.1 / gcc 4.0 generates ( with optimizer set to -os):

mfspr r0,256
stw r0,-8(r1)
oris r0,r0,0x8000
mtspr 256,r0
vcmpgtuh. v0,v3,v2
vcmpgtuh v0,v3,v2
beq cr6,L99
vor v2,v0,v0
lwz r12,-8(r1)
mtspr 256,r12
blr

Note the second "vcmpgtuh" instruction, which is completely superfluous. The peephole optimizer should recognize this situation and remove the instruction. (I've filed a bug with Apple; #4519214.) Anyone know if more recent versions of gcc are able to do this? I have some bottleneck code that could seriously benefit from this, and I'd rather avoid assembly if I can...


Ben, be glad the compiler is sometimes smarter than we are! ;-)

The vcmpgtuh. instruction takes 2 cycles to complete, hence the dependent beq 2 instructions later, but remember that this is a predicate instruction which only updates the CR, and AFAIK does not alter its dummy result register. The 2nd vcmpgtuh also takes 2 cycles to complete, but it returns a result (v0) and does not update the CR. Its result is used by the vor 2 instructions later if the test fails, but that 2nd instruction is the only way to get your results. The CPU remains busy this way (ie no stalls), but "that's how it has to be".

What you should be concerned about is saving and restoring the VRSAVE register across such a simple function. It needs to be inlined. Use this prototype declaration:

inline vector unsigned short peepholebug(vector unsigned short a, vector unsigned short b) __attribute__ ((__always_inline__,__nodebug__));

Shaun,

No, the predicate instruction DOES correctly set the result register as well as the condition register (according to Motorola's Altivec manual), and I've now verified using assembly that the code works the same with the superfluous instruction removed, only faster. (At least on my G5, but I'd be amazed if G4 is different.) Also, the lack of VRSAVE is intentional; the real function uses all the vector registers, and I get a speed win by setting VRSAVE once in the thread entrypoint by hand. The tiny function i posted was just a test case that isolates the issue.

Ben
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Re: Merom (Core): Intel's next-generation microarchitecture (From: Andrew Pinski <email@hidden>)
 >Re: Merom (Core): Intel's next-generation microarchitecture (From: Ian Ollmann <email@hidden>)
 >XCode 2.2.1 / gcc 4.0 Peephole Bug (From: Ben Weiss <email@hidden>)
 >Re: XCode 2.2.1 / gcc 4.0 Peephole Bug (From: Shaun Wexler <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.