|[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]|
I have a few general and specific Mac FFT questions I wanted to throw out there. Let me start with some quick background and some broader questions first.
In the Sept 2009 post from Richard Crandall here,
there's a very nice implementation of a very fast multithreaded FFT algorithm. I think that the MatrixFFT library is absolutely fantastic and thank R. Crandall, et. al. very much! Using the included tests for the scenarios I'm looking at (2D FFTs with n >= 2^19), the MatrixFFT library provides a 10-fold speedup from the Accelerate implementation. I just spent a few hours today modifying my code to use his MatrixFFT library.
In addition, I've also started playing with the OpenCL FFT implementation that was posted here
a month ago or so. The OpenCL implementation fails completely on my hardware (Radeon 4870) by throwing enormous L2 errors, producing incorrect answers at fairly slow (moderate?) speeds.
1. Assuming the OpenCL FFT implementation were to actually work, this is now three different FFT implementations on the Mac platform coming out of Apple (Accelerate, MatrixFFT, OpenCL_FFT), that each have performance advantages in different situations. Is there any hope of having a *single* FFT interface that can choose an appropriate implementation based on the machine's (heterogenous) hardware?
2. One issue with making a single FFT interface appears to be interleaved versus non-interleaved types (is this true? I don't actually know much about the FFT implementations themselves, but this is certainly true of the interfaces). When I moved my code from the Accelerate FFT to the MatrixFFT implementation with my existing code, I got about a 2-fold speedup (rather than the 10-fold I saw in the 'pure' test, despite FFTs being my speed limiter). Doing some basic performance testing and it appears that the limiting issue is that now I'm forced to convert the non-interleaved DSPSplitComplex types into the interleaved FFTComplex type (fftVDSPToInt and fftIntToVDSP). I could go back and convert the rest of my code to interleaved types, but is this a waste of time? What the best approach here?
3. More specifically now, it's also noted that out-of-place FFTs are slightly slower. But, if I still need a copy of the input, am I better off doing a memcpy and then an in-place FFT? Or are the implementations such that this is six-of-one, a-half-dozen-of-other?
4. Any ideas why the Radeon 4870 fails? I'd like to eventually get the full performance of my graphics card for the FFTs, if possible.
Jeffrey J. Early
_______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: This email sent to email@hidden
Visit the Apple Store online or at retail locations.
Copyright © 2011 Apple Inc. All rights reserved.