This might not be the right list but I figure that memcpy is
technically performance because it uses SSE (and hell, which Darwin
list do I pick? x86? kernel? user?). Ordinarily, I would submit a bug
report but I haven't been able to reproduce the problem in external
code.
Summary: I have some code that does a ton of FFT / DSP processing
which started to produce wildly inaccurate results when ported to the
Core Duo (the exact same code runs perfectly on PowerPC G4 / G5). I
eventually narrowed the problem down to what appears to be memcpy.
However, if I were to place bets on me being wrong or Apple's memcpy
being bogus, I'd know where I'd put the money.
Essentially I have now limited the code to something which looks like
this (which is still contained within the main part of the application):
unsigned loopCount = 0;
while (1) {
++loopCount;
When run on an Intel Core Duo (MacBook Pro), it produces the
following output (and output continues to be produced ad infinitum):
-- finished after loop 181 --
no. differences 1 = 0
no. differences 2 = 16
no. bytes copied = 524288
offset = 917504 (114688)
groupSize = 131072
spectraPerGroup = 16384
numSpectra = 131072
spectraLen = 8
r1 = 0x3388040
i1 = 0x3788040
r2 = 0x3b89040
i2 = 0x3f89040
-- finished after loop 225 --
no. differences 1 = 8
no. differences 2 = 0
no. bytes copied = 524288
offset = 524288 (65536)
groupSize = 131072
spectraPerGroup = 16384
numSpectra = 131072
spectraLen = 8
r1 = 0x3208040
i1 = 0x3608040
r2 = 0x3a09040
i2 = 0x3e09040
No output is produced when run on a PowerPC G4 (1 GHz PowerBook),
even after running for a length of time approaching infinity.
Also, note that the data in the DSPSplitComplex structure (from the
Accelerate framework) is allocated something like:
data.realp = [ 64 byte aligned (after standard malloc) of length
2097152 (float) ]
data.imagp = data.realp + 1048576
chirped.realp = [ 64 byte aligned (after standard malloc) of length
2097152 (float) ]
chirped.imagp = chirped.realp + 1048576
Even though I don't have separate code which reproduces the problem,
I do have a copy of the code which should be able click 'n' build via
Xcode (and run via the command line). It is open source so if someone
else wants to have a look that's fine.
The only possible thing I can think of is that there is some problem
during the memcpy (possibly due to the dual processors and the use of
the MOVNTDQ instruction...?). The fact that the differences come in
bursts of 16 bytes (multiples of 4 floats) seems to point this way.
Any additional debugging advice would be extremely helpful.