Re:[OT] PPC Float to Int
Re:[OT] PPC Float to Int
- Subject: Re:[OT] PPC Float to Int
- From: Ian Ollmann <email@hidden>
- Date: Wed, 13 Aug 2003 13:49:11 -0700
On Wed, 13 Aug 2003 Art Gillespie wrote:
>
So I'm porting some Windows DSP code from another developer and there
>
are quite a few Float->Int conversions per sample.
>
>
Is there anything faster on PPC than fctiw/fctiwz and the subsequent
>
stfd? The fctiw is fast enough, but writing the result out to memory
>
and then loading it into a integer register are making things, um,
>
not-cpu-friendly.
There is no direct data path between integer, floating point and vector
registers.* If you need to transfer data back and forth, you have to store
the data out, wait about 6 cycles and then load it back in. There is no
escaping that. On the G5, the delay is potentially many times greater.
This problem is not going away soon. As a result, the take home message
is don't transfer data back and forth between floating point register
files and integer register file if you can in any way avoid it.
There are a couple of ways around this problem for float to int
conversions. The first is to "just say no." Either don't convert or do the
conversion as the last step in your function so that you don't have to
load the integer data back in.
If you can't avoid that, then the best alternative is to use AltiVec. The
vector registers can hold and operate on both floating point and integer
data. There are instructions for converting between the two. This should
save all that store/load traffic. You'll get a nice speed boost as well.
Unfortunately, this doesn't work on G3, which has no AltiVec unit. There
is no emulator. G3s just crash when they hit AltiVec code.
http://developer.apple.com/hardware/ve
Finally, you can simply work around the store to load latency by filling
up the pipelines with so much data that the pipeline bubbles go away. This
will cause the code to perform as if every step in the process (fctiwz,
stfd, lwx) executed in a single cycle and the store to load latency
disappears. The throughput will top out at about 1 float / 2 cycles, which
is probably close to an order of magnitude better than what your current
compiler generated code does now. Try to unroll your loop so that you are
doing 8 conversions at a time in parallel. It is unlikely that doing the
easy thing in C is going to work, because the compiler is likely using the
same temporary stack double (for the stfd part) for all 8, forcing them to
operate in series defeating your well laid plans. (Compilers ain't smart.)
So..... whatchaprollygoddado is manually unroll this stuff and use a
little inline assembly to make sure the compiler has no say in the
process. It would look something like this:
#if defined( __GNUC__ )
#include <ppc_intrinsics.h>
#endif
//Unverifed code
void ConvertFloatsToInts( float *indata, int *outdata, int count )
{
int i;
double f0, f1, f2, f3, f4, f5, f6, f7;
//Note: fails if count not divisible by 8
for( i = 0; i <= count - 8; i += 8 )
{
//load the floats
f0 = indata[0];
f1 = indata[1];
f2 = indata[2];
f3 = indata[3];
f4 = indata[4];
f5 = indata[5];
f6 = indata[6];
f7 = indata[7];
//Convert to ints
//integer data is still in the FP register file
f0 = __fctiwz( f0 );
f1 = __fctiwz( f1 );
f2 = __fctiwz( f2 );
f3 = __fctiwz( f3 );
f4 = __fctiwz( f4 );
f5 = __fctiwz( f5 );
f6 = __fctiwz( f6 );
f7 = __fctiwz( f7 );
//store our the 32 bit low word of the double (
//which contains the int)
__stfiwx( f0, 0 * sizeof(int), outdata );
__stfiwx( f1, 1 * sizeof(int), outdata );
__stfiwx( f2, 2 * sizeof(int), outdata );
__stfiwx( f3, 3 * sizeof(int), outdata );
__stfiwx( f4, 4 * sizeof(int), outdata );
__stfiwx( f5, 5 * sizeof(int), outdata );
__stfiwx( f6, 6 * sizeof(int), outdata );
__stfiwx( f7, 7 * sizeof(int), outdata );
//If you needed to do more with those ints, you
//could safely start load them here and start
//working on them
outdata += 8;
indata += 8;
}
}
Is that sufficient? Probably not. Some compilers see inline asms and
interpret that as an invitation to schedule your code in highly
questionable ways. On GCC, it may be necessary to replace the f0 and f7
inline assembly lines for fctiwz with the really ugly GCC asm syntax to
defeat the scheduler and keep the loads and stores away from the dependent
fctiwz instructions. Declaring a asm volatile will prevent instructions
from being rescheduled past it.
// f0 = __fctiwz( f1 ); becomes:
asm volatile( "fctiwz %0, %1" : "=f" (f0 ) : "f" (f1) );
You can also use -fno-schedule-insns and -fno-schedule-insns2, but this
will turn of instruction scheduling for the whole file, which is not quite
what you want probably.
Sadly I am sure none of this is what you want to hear as someone porting
an app, but dems are the breaks.
Somewhere Apple has some prerolled high performance float to int
conversion code available for you to convert arrays. I can't remember
where it is.
If you are going to be doing your own hand tuning of this function, SimG4
is your friend.
Ian
*exception lvsl and lvsr
---------------------------------------------------
Ian Ollmann, Ph.D. email@hidden
---------------------------------------------------
_______________________________________________
coreaudio-api mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/coreaudio-api
Do not post admin requests to the list. They will be ignored.