I should note that >>8 is not just like /255. Take alpha=255,
color=255 as an example:
255*255 = 65025
The correct result should be clear. With 255 alpha, you should get back
the same image you put in, since 255 means 100% opacity.
Now, how do you correct for the "fixed point" multiplication afterward
to get the correct result?
65025/256 = 254.0039 -> 254 wrong! Whites should
stay white.
(65025+128)/256 = 254.5039 -> 254 wrong! Whites should stay
white
65025/255 = 255 correct!
(65025+127)/255 = 255.49 -> 255 Also correct, but does
round to nearest.
So, dividing by 255 is actually required to avoid image dimming,
especially over multiple compositing or premultiply/unpremultiply
operations.
vImage does /255.
On P4/xeon the hardware int<->float conversions are not that speedy.
I'm not sure about P3. The hardware instruction doesn't seem to provide
any advantage. The conversion on P4 is equally fast between a tight
loop using the hardware instruction, and one that uses the PowerPC
algorithm (including data transfer through memory) provided that the
software method is scheduled properly. (I used software pipelining.)
The main problem is that with naive compiled code, int<->float
conversions are almost never scheduled well with the software method. I
think this is why we suffer on PPC from these things. As Holger
mentioned, Altivec addresses this problem nicely.
I'm not sure that on x86 unrolling usually buys you that much. The
integer registers are mostly special/dedicated purpose so there isn't
much parallelism to work with. The x87 registers are stack based. You
may end up having to use fpr0 for every instruction. In neither case
are there many registers to work with. You are lucky if you can unroll
4-way. I sometimes find that the non-unrolled version of the loop works
best. You can do a little more on SSE* with the flat register file,
but then since it is only a 64-bit wide ALU, the pipelines are
effectively only 2 or 3 stages deep, so the win isn't that great. Intel
gets quite far with a combination of high clock frequency, low
latencies, and I'm guessing some aggressive instruction rescheduling
and store forwarding. These things work well for simple, non-unrolled
code.