Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Altivec: Extracting Floats From Vector Float



On Monday, April 5, 2004, at 07:04  AM, Roger Kylin wrote:

I did some playing around with vec_ste and found some surprising
results, and at least one question

If I have

	vector float f={1,2,3,4};
	float	f1,f2,f3,f4;

	vec_ste(f,0,&f1); printf("%f\n",f1);
	vec_ste(f,1,&f2); printf("%f\n",f2);
	vec_ste(f,2,&f3); printf("%f\n",f3);
	vec_ste(f,3,&f4); printf("%f\n",f4);

The output is:

1
2
3
4

But, if I do:

	vector float f={1,2,3,4};
	float	f1;

	vec_ste(f,0,&f1); printf("%f\n",f1);
	vec_ste(f,1,&f1); printf("%f\n",f1);
	vec_ste(f,2,&f1); printf("%f\n",f1);
	vec_ste(f,3,&f1); printf("%f\n",f1);

The output is:

1
1
1
1

Why are the subsequent calls to vec_ste overwriting the value f1?

I assume you mean why are they _not_ overwriting f1? Look again at the documentation for vec_ste(). Everything the vector unit does is aligned on 16 byte boundaries and AltiVec will not, under any circumstances, allow you to break this general rule. What the load/store element instructions do is to simply load/store an element at the same position it would normally have been at had you used vec_ld() or vec_st() but without touching the neighboring elements in memory. Also, in your previous code example, you got really, really lucky that f1,f2,f3,f4 were aligned correctly. The only way to make sure they are at the correct place is to do the following:


union {
	vector float v;
	float	s[4];
} swap;

This will guarantee that the floats s[0]..s[3] are correctly aligned. There is no other way to do it in C. (Of course, you could use GCC's __aligned__ attribute, but it's not exactly portable.)
Now, back to your question. Check out the results of this:


vector float example = (vector float)(1.0, 2.0, 3.0, 4.0);

vec_ste(example, 0, &swap.s[0]); printf("%f\n", swap.s[0]);
vec_ste(example, 1, &swap.s[0]); printf("%f\n", swap.s[1]);
vec_ste(example, 2, &swap.s[0]); printf("%f\n", swap.s[2]);
vec_ste(example, 3, &swap.s[0]); printf("%f\n", swap.s[3]);

This explains why you don't see swap.s[0] being overwritten with 2.0, 3.0, or 4.0, store element is like a restricted vec_st(). For more fun with the implicit rounding of the address down to the greatest multiple of 16 less than the given address, try replacing &swap.s[0] with &swap.s[1] or &swap.s[2], etc. in the vec_ste() calls.

The surprising result

Old code (buried within some for and if loops):

Vector float zero_vec={0.,0.,0.,0.};
Vector float score_vec;
Float ff[4];

If(vec_any_gt(score_vec,zero_vec)){
	GetFourFloats(score_vec,ff);
	for(i=0;i<4;i++)
		i1=(int)ff[i]-1;
		if(i1>=0){score[i1]=score[i1]+1;}
	}
}

, where GetFourFloats copies score_vec into ff via a union.

New code:

Vector float zero_vec={0.,0.,0.,0.};
Vector float score_vec;
Float ff[4];

If(vec_any_gt(score_vec,zero_vec)){
	for(i=0;i<4;i++)
		vec_ste(score_vec,I,ff[i]);
		i1=(int)ff[i]-1;
		if(i1>=0){score[i1]=score[i1]+1;}
	}
}


Running the code the old way took 56 seconds, the new way took 64 seconds... Any guesses why using vec_ste was slower?

The vec_ste() code is slower for the very good reason that you're not using it like it was designed. The entire point of load/store element is to use the vector unit for things you would normally have to use the scalar unit for. In your case, you are still using the scalar unit to perform a calculation, so there's really no point in using vec_ste() at all. In your case, since you're using the result to index into an array, there's no escape but to use the scalar units. However, there are a few optimizations you could still make:


vector float one_vec = (vector float)(1.0);
vector float score_vec;

if(!vec_all_lt(score_vec, one_vec)) {;
	union {
		vector signed int v;
		signed int s[4];
	} swap;
		
	swap.v = vec_cts(vec_sub(score_vec, one_vec));
	if(0 <= swap.s[0])
		score[swap.s[0]]++;
	if(0 <= swap.s[1])
		score[swap.s[1]]++;
	if(0 <= swap.s[2])
		score[swap.s[2]]++;
	if(0 <= swap.s[3])
		score[swap.s[3]]++;
}

This should work (please note that I typed it directly into my mail client so it may need some tweaking) and it should be slightly faster than what you had before.

Note that what your doing, especially in a tight loop, will kill performance. If there is a way to save all the indexes that you've calculated in the vector code and then only update the score array once you're done doing vector calculations, that will be a lot cleaner and potentially much faster. Also note that you can actually index into small tables using the vector unit, but the tables have to be relatively small. Look at the permute instruction for that.

Brendan Younger
_______________________________________________
scitech mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/scitech
Do not post admin requests to the list. They will be ignored.




Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.