What you should be concerned about is saving and restoring the
VRSAVE register across such a simple function. It needs to be
inlined. Use this prototype declaration:
inline vector unsigned short peepholebug(vector unsigned short a,
vector unsigned short b) __attribute__
((__always_inline__,__nodebug__));
No, the predicate instruction DOES correctly set the result
register as well as the condition register (according to
Motorola's Altivec manual), and I've now verified using assembly
that the code works the same with the superfluous instruction
removed, only faster. (At least on my G5, but I'd be amazed if G4
is different.) Also, the lack of VRSAVE is intentional; the real
function uses all the vector registers, and I get a speed win by
setting VRSAVE once in the thread entrypoint by hand. The tiny
function i posted was just a test case that isolates the issue.
These instructions push VRSAVE and then mark v0 used:
vcmpgtuh. v0,v3,v2
vcmpgtuh v0,v3,v2
beq cr6,L99
vor v2,v0,v0
These instructions pop and restore VRSAVE:
lwz r12,-8(r1)
mtspr 256,r12
I would only expect a speedup on a 7400/7410 G4 which executes it in
1 cycle in its VSIU. Regardless if the 2nd vcmpgtuh is redundant, it
takes 2 cycles to complete on either a 7450/7455 G4 or a 970 G5, so
the compiler optimizes it that way by default. The vor is probably
executed speculatively anyhow, and on G5 it will be first in its
dispatch group because it follows a branch. YMMV.
If you declared this function always-inline, it would eliminate up to
8 instructions: 6 for VRSAVE op's, the redundant vcmpgtuh (if
another group of instructions could fill that slot), and the blr.
--
Shaun Wexler
MacFOH http://www.macfoh.com
Arguing with an engineer is like wrestling with a pig in mud.
After a while, you realize the pig is enjoying it.