Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: float to int (kinda OT)




Thanks for the vImage plug!

I should note that >>8 is not just like /255.  Take alpha=255, color=255 as an example:

        255*255 = 65025

The correct result should be clear. With 255 alpha, you should get back the same image you put in, since 255 means 100% opacity.
Now, how do you correct for the "fixed point" multiplication afterward to get the correct result?


        65025/256 = 254.0039 -> 254              wrong! Whites should stay white.
        (65025+128)/256 = 254.5039 -> 254    wrong! Whites should stay white
        65025/255 = 255                                   correct!
        (65025+127)/255 = 255.49 -> 255        Also correct, but does round to nearest. 


So, dividing by 255 is actually required to avoid image dimming, especially over multiple compositing or premultiply/unpremultiply operations. 

vImage does /255.

On P4/xeon the hardware int<->float conversions are not that speedy. I'm not sure about P3. The hardware instruction doesn't seem to provide any advantage. The conversion on P4 is equally fast between a tight loop using the hardware instruction, and one that uses the PowerPC algorithm (including data transfer through memory) provided that the software method is scheduled properly. (I used software pipelining.)  The main problem is that with naive compiled code, int<->float conversions are almost never scheduled well with the software method. I think this is why we suffer on PPC from these things. As Holger mentioned, Altivec addresses this problem nicely. 

I'm not sure that on x86 unrolling usually buys you that much. The integer registers are mostly special/dedicated purpose so there isn't much parallelism to work with. The x87 registers are stack based. You may end up having to use fpr0 for every instruction. In neither case are there many registers to work with. You are lucky if you can unroll 4-way. I sometimes find that the non-unrolled version of the loop works best.  You can do a little more on SSE* with the flat register file, but then since it is only a 64-bit wide ALU, the pipelines are effectively only 2 or 3 stages deep, so the win isn't that great. Intel gets quite far with a combination of high clock frequency, low latencies, and I'm guessing some aggressive instruction rescheduling and store forwarding. These things work well for simple, non-unrolled code. 

Ian

_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Graphics card tricks (From: "Edward K. Chew" <email@hidden>)
 >Re: Graphics card tricks (From: Holger Bettag <email@hidden>)
 >Re: Graphics card tricks (From: Niall Dalton <email@hidden>)
 >float to int (kinda OT) (From: Ando Sonenblick <email@hidden>)
 >Re: float to int (kinda OT) (From: Brendan Younger <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.