Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fast atan2



On Jan 20, 2005, at 4:27 PM, Ian Ollmann wrote:

On Jan 20, 2005, at 3:52 PM, Shaun Wexler wrote:
My vectorized version of atan2 that I wrote for MacFOH performs the full rectangular-to-polar conversion, including UNWRAPPED normalized phase and magnitude in decibels, and profiles 30x faster than (float)atan2(y,x) with floats. The atan2f "cheesy" portion of the vectorized code is 24x faster than libm.

It probably should be mentioned at this point (before the inquest starts) that the reason why libm is the way it is is several fold:


1) libm is required to deliver correct results (not sorta correct) including edge cases, with the correct rounding mode, exceptions, etc.
This is expensive because it involves configuring the FPSCR and often quite a bit of branching to deal with edge cases not covered by the general purpose algorithm.
2) libm is required to take a single set of arguments and return a single result.
Read: not enough data -> not enough work to do -> pipeline bubbles
3) libm is sitting behind a dylb stub
..and in certain cases also triggers PIC


So, in summary, it is the way it is because it is required to be so by standards. It is a straw man, set up to be knocked over.

People shouldn't see this as a slam against libm; they should use it as proof that performant applications require inlining/pipelining hot functions, to remove as much overhead as possible.


One thing I've grown to love about AltiVec is that you can perform a lot of operations in advance and/or for free. If you Shark a function and it has even a moderate amount of stalls (more than a few cycles total) then it is a ripe candidate for performing some free work for you. When I rewrap my unwrapped phase, I have to conditionally generate some additional data points which require two 4x4 matrix rotations; the first rotation is performed for free by the VPERM unit prior to testing for the condition, and its stores happen while the 2nd rotation is performed post-condition, otherwise the first rotation's results are simply discarded at no loss. Whenever you see stalls, you can usually sneak more code into a function, and often the gain is more than the sum of the parts. ;)
--
Shaun Wexler
MacFOH
http://www.macfoh.com


Attachment: smime.p7s
Description: S/MIME cryptographic signature

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden

References: 
 >Fast atan2 (From: Robert Purves <email@hidden>)
 >Re: Fast atan2 (From: rick <email@hidden>)
 >Re: Fast atan2 (From: Ian Ollmann <email@hidden>)
 >Re: Fast atan2 (From: Ian Ollmann <email@hidden>)
 >Re: Fast atan2 (From: Ian Ollmann <email@hidden>)
 >Re: Fast atan2 (From: Shaun Wexler <email@hidden>)
 >Re: Fast atan2 (From: Ian Ollmann <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.