Then I can shark it and see if it is indeed better or not.
The question I have is that my code also runs on windows x86. I have
no tool like shark there, and have no real knowledge about the actual
workings of the x86 chips, etc.
So if the int math version is faster on the mac, should I keep the
float version on the PC? My presumption is yes, but you guys on this
list seem pretty dang knowlegeable about some pretty intense stuff
with regards to chips and I figure someone surely knows about the pc
chips... thought I'd ask.
Since all you're doing is pre-multiplying alpha, you can use the vImage
portion of the Accelerate framework. Also, since there's a pretty good
chance you're doing other image processing operations, you'll probably
want to use vImage for the rest of those as well. In general, float ->
int or int -> float conversions are ridiculously inefficient on the
PowerPC since the only way to do it is to move the data into memory and
load it back into the appropriate register file. On x86, I believe
there are instructions to explicitly move the data between the
different register files. That said, the code you're writing will
probably not run very quickly on any architecture. Below is a revised
version.
I assume that "components" is something like "struct components {
uint8_t alpha; struct colors colors; }" with "struct colors { uint8_t
red; uint8_t green; uint8_t blue; }" If so, you've basically got 8-bit
ARGB pixels and can write something like this:
uint8_t* destination_ptr = ?;
for(i = 0; i < number_of_pixels; i ++) {
uint16_t a, r, g, b;
a = destination_ptr[0];
r = destination_ptr[1];
g = destination_ptr[2];
b = destination_ptr[3];
r = (a * r) >> 8;
g = (a * g) >> 8;
b = (a * b) >> 8;
Note that the right shift by 8 is an approximation to dividing by 255,
but it shouldn't make that much difference. Note also that there is a
*lot* more optimization that you can apply to this function, which is
why you should use the vImage version to save yourself time. However,
if you want your x86 code to run faster too, you'll need to start
thinking about unrolling the loop, and probably do some tricks to avoid
having so many 1 byte loads and stores, etc. You could also probably
write vector code for the SSE2 instructions as well.