Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Should I "manualy unfold" LoadUnaligned" function?



On 1 Nov 2005, at 08:04, Rustam Muginov wrote:

Hello all.
I am using LoadUnaligned function described here:
http://developer.apple.com/hardware/ve/alignment.html
The function is pretty simple:

...

Then iterating through the data in 128bit chunks, the third instruction in the function seems an invariant for me. It would always create the same permute mask then loaded at the 16bytes*n offsets.

Yep, as long as you're always incrementing the unaligned address location by 16 * n bytes.


Should I manualy "unfold" this function in the cycle, i.e. create permute mask
mask = vec_lvsl(0, target);
before cycle started, and do only three instructions inside the cycle:
  MSQ = vec_ld(0, target);
  LSQ = vec_ld(15, target);
  return vec_perm(MSQ, LSQ, mask);

If you're working in a loop then you can get it down to one load and one permutation per loop iteration. For example:


void doWork(float *data, int dataLen)
{
	vector float u0 = vec_ld(0, data);
	vector unsigned char loadPerm = vec_lvsl(0, data);
	int end;
	int i;

	end = dataLen - (dataLen & 3);
	for (i = 0; i < end; i += 4) {
		float *p = data + i;
		vector float u1;
		vector float v;

		// each unaligned vector is constructed from two aligned vectors
		u1 = vec_ld(16, p);
		v = vec_perm(u0, u1, loadPerm);

		// update previous aligned vector
		u0 = u1;

		// do work
		// ...
	}

	// handle tail elements
	for (i = end; i < dataLen; ++i) {
		// do work
		// ...
	}
}

However, note that I use vec_ld(16, target) instead of vec_ld(15, target) like Apple recommends. This is because in some cases, the data that I want to load may actually be aligned and using vec_ld(15, target) would load the same location as vec_ld(0, target), therefore the u0 vector wouldn't be updated with the correct data.

If you can guarantee that your data *won't* be aligned then just use vec_ld(15, target) as Apple does. If you're data might be aligned then use vec_ld(16, target) but be aware that you might have problems at the end of the array when you access data that you shouldn't do. In this case just allocate an extra 16 bytes of data at the end of the array. This is mentioned specifically in the documentation that you referenced.

The above code can be similarly extended when unrolling the loop:

void doWork(float *data, int dataLen)
{
	vector float u0 = vec_ld(0, data);
	vector unsigned char loadPerm = vec_lvsl(0, data);
	int end;
	int i;

	end = dataLen - (dataLen & 15);
	for (i = 0; i < dataLen; i += 16) {
		float *p = data + i;
		vector float u1, u2, u3, u4;
		vector float v1, v2, v3, v4;

		// each unaligned vector is constructed from two aligned vectors
		u1 = vec_ld(16, p);
		u2 = vec_ld(32, p);
		u3 = vec_ld(48, p);
		u4 = vec_ld(64, p);
		v1 = vec_perm(u0, u1, loadPerm);
		v2 = vec_perm(u1, u2, loadPerm);
		v3 = vec_perm(u2, u3, loadPerm);
		v4 = vec_perm(u3, u4, loadPerm);

		// update previous aligned vector
		u0 = u4;

		// do work
		// ...
	}

	// handle tail elements
	for (i = end; i < dataLen; ++i) {
		// do work
		// ...
	}
}

Or it is enough to declare this function as inline and compiler would remove invariant from loop itself?
I am interested in the behaviour of both gcc3.3. and 4.0

I don't think GCC is currently smart enough for this (since the memory location in question is a runtime variable) but I might be wrong. You should just compile the code and see what assembly GCC spits out. You can look at an assembly listing by right clicking on the code in Xcode (either in the file browser or the source code listing) and selecting "Show Assembly Code".


You can also do this from the command line using:
	gcc -S sourcecode.c
which will create a sourcecode.s file, or for compiled code:
	objdump -tV filename
which will disassemble all the code in the executable.

I'm usually much to lazy to do any of those so I do a Shark profile of the code (make sure you've got debugging symbols turned on). Double click on the vec_lvsl instruction in the source listing in Shark and it will take you to it's corresponding assembly code.

The people who made Shark should be given goddamn medals. Not just for that but for a whole bunch of other stuff that's in it.

I would like to maximize performance in the tight loop, but code readability is also and issue.
Thank you in advance.

If you work out a system for naming the variables and make sure you don't make any typos then this sort of code shouldn't be too bad. Otherwise, use preprocessor macros, e.g.:


#define LOAD(x, n) \
	x ## n = vec_ld(16 * n, data)
#define PERM(x, a, b) \
	x ## b = vec_perm(u ## a, u ## b, loadPerm)
and use:
	LOAD(u, 1);
	LOAD(u, 2);
	LOAD(u, 3);
	PERM(v, 0, 1);
	PERM(v, 1, 2);
	PERM(v, 2, 3);
which should translate to:
	u1 = vec_ld(16, data);
	u2 = vec_ld(32, data);
	u3 = vec_ld(48, data);
	v1 = vec_ld(u0, u1, loadPerm);
	v2 = vec_ld(u1, u2, loadPerm);
	v3 = vec_ld(u2, u3, loadPerm);
etc.



r i c k
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Should I "manualy unfold" LoadUnaligned" function? (From: Rustam Muginov <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.