So why does such a modern OS still not have processor affinity?
It should be an easy and obvious win for the kernel team to
implement it.
That is an excellent question to put in a Feature Request bug
report directed towards the Kernel/Features component.
A search for "strip mining" seems to mean just working on a cache-
sized amount of data. My code already works with L2 sized chunks so I
guess I'll have to second the processor affinity request. A bit of
searching reveals that Linux seems to have the sort of thing I'm
looking for with sched_setaffinity / sched_getaffinity: http://www.die.net/doc/linux/man/man2/sched_setaffinity.2.html
It looks as though the interface is a bitfield (one bit for each
processor) and an appropriate mask for the desired processor affinity.
The only BSD / Darwin references I could find were to a kernel
scheduler called ULE but I couldn't work out if this has been
implemented, is still a work in progress or dead.
I've also found a few references to "utilBindThreadToCPU" from the
CHUD framework but this seems to be purely an experimental interface
for testing purposes. I take it this is not The Right Way.
On a related note, what is a good size for the amount of data to work
on with Apple's vDSP Fourier transform routines? Currently, my code
processes data in chunks of groupSize bytes:
spectraPerGroup = l2CacheSize / (4 * spectraLen * sizeof(float));
groupSize = spectraLen * spectraPerGroup;
=> groupSize = l2CacheSize / (4 * sizeof(float))
and then does Fourier transforms with:
if (canOverwriteData) {
if (spectraPerGroup == 1)
vDSP_fft_zip(fftSetup, &input, 1, fftLog2, FFT_INVERSE);
else
vDSP_fftm_zip(fftSetup, &input, 1, spectraLen, fftLog2,
spectraPerGroup, FFT_INVERSE);
} else {
if (spectraPerGroup == 1)
vDSP_fft_zop(fftSetup, &input, 1, &freqData, 1, fftLog2,
FFT_INVERSE);
else
vDSP_fftm_zop(fftSetup, &input, 1, spectraLen, &freqData, 1,
spectraLen, fftLog2, spectraPerGroup, FFT_INVERSE);
}
Should I be using a different value for groupSize with the "zip"
versus "zop" routines (presumably "zip" uses less memory).
Additionally, the most common FFT length is 131,072 (i.e. 1 MiB of
data). Is there a cunning way to divide this between the caches of
multiple processors, e.g. do the twiddle factors multiplication myself?