On Dec 11, 2009, at 16:30, Jeffrey J. Early wrote:
1. ... Is there any hope of having a *single* FFT interface that can
choose an appropriate implementation based on the machine's
(heterogenous) hardware?
I am the primary author of the current Accelerate/vecLib/vDSP FFT
implementations, so I will address some of your questions from the
Accelerate perspective.
One can hope for a unified interface, but I am skeptical for several
reasons. For one thing, the implementations are moving targets, so
trying to write one interface that picks the best of the others is not
easy. The OpenCL, Accelerate, and MatrixFFTs are written by separate
groups, and we are each trying to improve our offerings and keep up
with new hardware releases. And some information needed to make the
choice simply is not known to the FFT routine. For example, does the
caller want to get the result as soon as possible even if it consumes
more total processor time (by using more processors)? Or should this
call use the most efficient single-processor routine because the other
processors are busy with other work?
2. One issue with making a single FFT interface appears to be
interleaved versus non-interleaved types (is this true? I don't
actually know much about the FFT implementations themselves, but
this is certainly true of the interfaces).
Yes, separate code has to be written in the FFT to handle interleaved
and non-interleaved cases. Interleaved complex data is less conducive
to calculation with the SIMD hardware, so performance will generally
lag behind that of interleaved complex data. However, a special FFT
for interleaved data will perform better than de-interleaving the
data, using the separated-data FFT, and re-interleaving the data. We
have not implemented that because we have not had much demand for it.
... I could go back and convert the rest of my code to interleaved
types, but is this a waste of time? What the best approach here?
I cannot say without knowing much more about your application, and,
even then, trying it might be the only way to find out. If you were
working primarily with vDSP, I would generally recommend doing as much
work with separated complex data as possible. However, if you have to
move to interleaved data for MatrixFFT, I cannot say where the
crossover lies.
3. More specifically now, it's also noted that out-of-place FFTs are
slightly slower. But, if I still need a copy of the input, am I
better off doing a memcpy and then an in-place FFT? Or are the
implementations such that this is six-of-one, a-half-dozen-of-other?
In the Accelerate FFTs, moving the data from the input to the output
is nearly free. It is done during the regular processing in the first
pass—read from the input, calculate, write to the output. If you do
not need the input data, then in-place processing may be slightly
faster, since it uses less memory and hence less cache and fewer page
table entries. However, if you need a copy of the input data, then
using an Accelerate FFT out-of-place should be faster than using
memcpy and an in-place FFT.
One caveat if you are running on an older machine: This might not
apply to PowerPC implementations of all the FFT variants.
Fall seven times, stand up eight. — Japanese proverb
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden