Compiler optimization doing strange things
Compiler optimization doing strange things
- Subject: Compiler optimization doing strange things
- From: Steve Checkoway <email@hidden>
- Date: Sun, 14 May 2006 19:31:46 -0700
I'm seeing some very strange interactions with altivec and compiler
optimizations that I'm hoping someone can explain to me. The basic
algorithm I'm writing is:
int32_t *dest;
const int16_t *src;
short volume;
for( size_t pos = 0; pos < size; ++pos )
dest[pos] += src[pos] * volume;
dest is 16 byte aligned but src is not. I'm loading, permuting, and
splatting volume into vol and I'm following the docs for dealing with
unaligned data. The important part of the code is:
vSInt16 MSQ = vec_ld( 0, src );
vSInt16 LSQ;
mask = vec_add( vec_lvsl(15, src), vec_splat_u8(1) );
while( size > 7 )
{
// Load next 16 bytes.
LSQ = vec_ld( 15, src );
vSInt16 data = vec_perm( MSQ, LSQ, mask );
vSInt32 result1 = vec_ld( 0, dest );
vSInt32 result2 = vec_ld( 16, dest );
vSInt32 even = vec_mule( data, vol );
vSInt32 odd = vec_mulo( data, vol );
vSInt32 first = vec_mergeh( even, odd );
vSInt32 second = vec_mergel( even, odd );
vec_st( vec_add(result1, first), 0, dest );
vec_st( vec_add(result2, second), 16, dest );
dest += 8;
src += 8;
size -= 8;
MSQ = LSQ;
}
I handle the remain data in the same way except I'm being careful to
store only up to the remaining size bytes. Originally, my while
loop's condition was while( size & ~0x7 ). This was producing one
copy of the loop with -O3 and it had quite a few stalls on the G5
that it seemed like it could remove by unrolling the loop further.
The code produces only one stall on the G4.
When I changed the condition to size > 7, it looks like it unrolled
the loop 4 times but just copied the code exactly. In fact, the
number of stalls went up from what I can tell (for both the G4 and
the G5).
Three questions:
1. Why does the compiler not do loop unrolling when I use size & ~0x7?
2. When the compiler does unroll the loop why does it not interleave
the independent instructions?
3. Is there perhaps a better way to write this algorithm to eliminate
these stalls?
One final point, all of this was while targeting the G4. If I target
the G5, the number of stalls does not decrease for the G5 (in either
case--actually, in the unrolling case at least, it increases by one)
and the number of stalls increases for the G4, probably due to the
differing placements of the vector loads. Targeting the G4 separates
the loads, targeting the G5 places them next to each other (as would
be expected, sort of).
Thanks,
- Steve
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden