Mailing Lists: Apple Mailing Lists
Image of Mac OS face in stamp
Re: [apple scitech] FFT Implementations & Best Practices
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [apple scitech] FFT Implementations & Best Practices



On Dec 11, 2009, at 16:30, Jeffrey J. Early wrote:

1. ... Is there any hope of having a *single* FFT interface that can choose an appropriate implementation based on the machine's (heterogenous) hardware?

I am the primary author of the current Accelerate/vecLib/vDSP FFT implementations, so I will address some of your questions from the Accelerate perspective.


One can hope for a unified interface, but I am skeptical for several reasons. For one thing, the implementations are moving targets, so trying to write one interface that picks the best of the others is not easy. The OpenCL, Accelerate, and MatrixFFTs are written by separate groups, and we are each trying to improve our offerings and keep up with new hardware releases. And some information needed to make the choice simply is not known to the FFT routine. For example, does the caller want to get the result as soon as possible even if it consumes more total processor time (by using more processors)? Or should this call use the most efficient single-processor routine because the other processors are busy with other work?

2. One issue with making a single FFT interface appears to be interleaved versus non-interleaved types (is this true? I don't actually know much about the FFT implementations themselves, but this is certainly true of the interfaces).

Yes, separate code has to be written in the FFT to handle interleaved and non-interleaved cases. Interleaved complex data is less conducive to calculation with the SIMD hardware, so performance will generally lag behind that of interleaved complex data. However, a special FFT for interleaved data will perform better than de-interleaving the data, using the separated-data FFT, and re-interleaving the data. We have not implemented that because we have not had much demand for it.


... I could go back and convert the rest of my code to interleaved types, but is this a waste of time? What the best approach here?

I cannot say without knowing much more about your application, and, even then, trying it might be the only way to find out. If you were working primarily with vDSP, I would generally recommend doing as much work with separated complex data as possible. However, if you have to move to interleaved data for MatrixFFT, I cannot say where the crossover lies.


3. More specifically now, it's also noted that out-of-place FFTs are slightly slower. But, if I still need a copy of the input, am I better off doing a memcpy and then an in-place FFT? Or are the implementations such that this is six-of-one, a-half-dozen-of-other?

In the Accelerate FFTs, moving the data from the input to the output is nearly free. It is done during the regular processing in the first pass—read from the input, calculate, write to the output. If you do not need the input data, then in-place processing may be slightly faster, since it uses less memory and hence less cache and fewer page table entries. However, if you need a copy of the input data, then using an Accelerate FFT out-of-place should be faster than using memcpy and an in-place FFT.


One caveat if you are running on an older machine: This might not apply to PowerPC implementations of all the FFT variants.

				— edp (Eric Postpischil)
				http://edp.org

Fall seven times, stand up eight. — Japanese proverb


_______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: This email sent to email@hidden
References: 
 >[apple scitech] FFT Implementations & Best Practices (From: "Jeffrey J. Early" <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2011 Apple Inc. All rights reserved.