A naive compiler will branch when generating (remainder255 >= (255 +
128)); that's a comparison, and comparisons generate branches.
A smarter compiler might have a clever way of avoiding the branch,
perhaps involving a subtract followed by a cntlzw (?). That's just off
the top of my head; I haven't tested it.
If "alpha" is a constant across the whole image operation, you could
multiply it by a cleverly-chosen constant so that "(alpha *
multiplier) >> something" gives you numbers scaled to any range you
prefer. This is a trick I've used in the past with good results. (The
constant tends to be something wacky; I remember getting 0x8102 or
0x10203 for various operations in the past.) It only has one caveat;
some PowerPCs can multiply tiny numbers (i.e. 8-bit values) a little
faster than big numbers. I have no idea if this is ancient history or
if G4s and G5s still have this restriction. Still, I'd rather pay an
extra two cycles on the multiply than add extra instructions to the
inner loop.
Fast ways of doing this are not a tightly held secret. The beautiful
thing about fixed point operations is that approximations can turn out
to be completely correct after rounding. No reason to spend a lot of
time doing a divide when a 1st order polynomial will do! Divide by a
number sufficiently close to 1/255 to give the right result. You can
get a free right shift if you use mulhw(u). GCC knows this one. So, for
example, for the inner loop of:
int main( void )
{
int alpha, red;
for( alpha = 0; alpha < 256; alpha++ )
{
for( red = 0; red < 256; red++ )
{
int correct = (alpha*red+127)/255;
printf( "%d\n", correct );
}
}
return 0;
}
...GCC does the following:
So unfortunately, this is one of those cases where if you had just
written
(alpha * red + 127 )/ 255
it might have been faster.
As it turns out, in the particular case of /255, the result of the
multiplication can sometimes be generated by a permute instruction
instead of a multiplication, so you might not even need to do a
multiply.