Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Compiler optimization doing strange things

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Compiler optimization doing strange things

Subject: Compiler optimization doing strange things
From: Steve Checkoway <email@hidden>
Date: Sun, 14 May 2006 19:31:46 -0700

I'm seeing some very strange interactions with altivec and compiler optimizations that I'm hoping someone can explain to me. The basic algorithm I'm writing is:

int32_t *dest;
const int16_t *src;
short volume;
for( size_t pos = 0; pos < size; ++pos )
	dest[pos] += src[pos] * volume;

dest is 16 byte aligned but src is not. I'm loading, permuting, and splatting volume into vol and I'm following the docs for dealing with unaligned data. The important part of the code is:

	vSInt16 MSQ = vec_ld( 0, src );
	vSInt16 LSQ;

	mask = vec_add( vec_lvsl(15, src), vec_splat_u8(1) );

	while( size > 7 )
	{
		// Load next 16 bytes.
		LSQ = vec_ld( 15, src );

		vSInt16 data = vec_perm( MSQ, LSQ, mask );
		vSInt32 result1 = vec_ld( 0, dest );
		vSInt32 result2 = vec_ld( 16, dest );
		vSInt32 even = vec_mule( data, vol );
		vSInt32 odd = vec_mulo( data, vol );
		vSInt32 first = vec_mergeh( even, odd );
		vSInt32 second = vec_mergel( even, odd );

		vec_st( vec_add(result1, first), 0, dest );
		vec_st( vec_add(result2, second), 16, dest );
		dest += 8;
		src += 8;
		size -= 8;
		MSQ = LSQ;
	}

I handle the remain data in the same way except I'm being careful to store only up to the remaining size bytes. Originally, my while loop's condition was while( size & ~0x7 ). This was producing one copy of the loop with -O3 and it had quite a few stalls on the G5 that it seemed like it could remove by unrolling the loop further. The code produces only one stall on the G4.

When I changed the condition to size > 7, it looks like it unrolled the loop 4 times but just copied the code exactly. In fact, the number of stalls went up from what I can tell (for both the G4 and the G5).

Three questions: 1. Why does the compiler not do loop unrolling when I use size & ~0x7? 2. When the compiler does unroll the loop why does it not interleave the independent instructions? 3. Is there perhaps a better way to write this algorithm to eliminate these stalls?

One final point, all of this was while targeting the G4. If I target the G5, the number of stalls does not decrease for the G5 (in either case--actually, in the unrolling case at least, it increases by one) and the number of stalls increases for the G4, probably due to the differing placements of the vector loads. Targeting the G4 separates the loads, targeting the G5 places them next to each other (as would be expected, sort of).

Thanks,

- Steve
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xcode-users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden



Prev by Date:
How to change Build Settings like SRCROOT?

Next by Date:
Re: Using Fix and Continue with Object Alloc

Previous by thread:
Re: How to change Build Settings like SRCROOT?

Next by thread:
Re: Compiler optimization doing strange things

Index(es):

Date
Thread