Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Altivec: Extracting Floats From Vector Float



Brendan,

Your suggestion helped a lot.  On my first test, it was about 25%
faster...

Roger


-----Original Message-----
From: email@hidden
[mailto:email@hidden] On Behalf Of Brendan Younger
Sent: Monday, April 05, 2004 10:32 PM
To: Roger Kylin
Cc: 'Apple SciTech'
Subject: Re: Altivec: Extracting Floats From Vector Float


On Monday, April 5, 2004, at 07:04  AM, Roger Kylin wrote:

> I did some playing around with vec_ste and found some surprising 
> results, and at least one question
>
> If I have
>
> 	vector float f={1,2,3,4};
> 	float	f1,f2,f3,f4;
>
> 	vec_ste(f,0,&f1); printf("%f\n",f1);
> 	vec_ste(f,1,&f2); printf("%f\n",f2);
> 	vec_ste(f,2,&f3); printf("%f\n",f3);
> 	vec_ste(f,3,&f4); printf("%f\n",f4);
>
> The output is:
>
> 1
> 2
> 3
> 4
>
> But, if I do:
>
> 	vector float f={1,2,3,4};
> 	float	f1;
>
> 	vec_ste(f,0,&f1); printf("%f\n",f1);
> 	vec_ste(f,1,&f1); printf("%f\n",f1);
> 	vec_ste(f,2,&f1); printf("%f\n",f1);
> 	vec_ste(f,3,&f1); printf("%f\n",f1);
>
> The output is:
>
> 1
> 1
> 1
> 1
>
> Why are the subsequent calls to vec_ste overwriting the value f1?

I assume you mean why are they _not_ overwriting f1?  Look again at the 
documentation for vec_ste().  Everything the vector unit does is 
aligned on 16 byte boundaries and AltiVec will not, under any 
circumstances, allow you to break this general rule.  What the 
load/store element instructions do is to simply load/store an element 
at the same position it would normally have been at had you used 
vec_ld() or vec_st() but without touching the neighboring elements in 
memory.  Also, in your previous code example, you got really, really 
lucky that f1,f2,f3,f4 were aligned correctly.  The only way to make 
sure they are at the correct place is to do the following:

union {
	vector float v;
	float	s[4];
} swap;

This will guarantee that the floats s[0]..s[3] are correctly aligned.  
There is no other way to do it in C.  (Of course, you could use GCC's 
__aligned__ attribute, but it's not exactly portable.)
Now, back to your question.  Check out the results of this:

vector float example = (vector float)(1.0, 2.0, 3.0, 4.0);

vec_ste(example, 0, &swap.s[0]); printf("%f\n", swap.s[0]);
vec_ste(example, 1, &swap.s[0]); printf("%f\n", swap.s[1]);
vec_ste(example, 2, &swap.s[0]); printf("%f\n", swap.s[2]);
vec_ste(example, 3, &swap.s[0]); printf("%f\n", swap.s[3]);

This explains why you don't see swap.s[0] being overwritten with 2.0, 
3.0, or 4.0, store element is like a restricted vec_st().  For more fun 
with the implicit rounding of the address down to the greatest multiple 
of 16 less than the given address, try replacing &swap.s[0] with 
&swap.s[1] or &swap.s[2], etc. in the vec_ste() calls.

> The surprising result
>
> Old code (buried within some for and if loops):
>
> Vector float zero_vec={0.,0.,0.,0.};
> Vector float score_vec;
> Float ff[4];
>
> If(vec_any_gt(score_vec,zero_vec)){
> 	GetFourFloats(score_vec,ff);
> 	for(i=0;i<4;i++)
> 		i1=(int)ff[i]-1;
> 		if(i1>=0){score[i1]=score[i1]+1;}
> 	}
> }
>
> , where GetFourFloats copies score_vec into ff via a union.
>
> New code:
>
> Vector float zero_vec={0.,0.,0.,0.};
> Vector float score_vec;
> Float ff[4];
>
> If(vec_any_gt(score_vec,zero_vec)){
> 	for(i=0;i<4;i++)
> 		vec_ste(score_vec,I,ff[i]);
> 		i1=(int)ff[i]-1;
> 		if(i1>=0){score[i1]=score[i1]+1;}
> 	}
> }
>
>
> Running the code the old way took 56 seconds, the new way took 64 
> seconds... Any guesses why using vec_ste was slower?

The vec_ste() code is slower for the very good reason that you're not 
using it like it was designed.  The entire point of load/store element 
is to use the vector unit for things you would normally have to use the 
scalar unit for.  In your case, you are still using the scalar unit to 
perform a calculation, so there's really no point in using vec_ste() at 
all.  In your case, since you're using the result to index into an 
array, there's no escape but to use the scalar units.  However, there 
are a few optimizations you could still make:

vector float one_vec = (vector float)(1.0);
vector float score_vec;

if(!vec_all_lt(score_vec, one_vec)) {;
	union {
		vector signed int v;
		signed int s[4];
	} swap;
		
	swap.v = vec_cts(vec_sub(score_vec, one_vec));
	if(0 <= swap.s[0])
		score[swap.s[0]]++;
	if(0 <= swap.s[1])
		score[swap.s[1]]++;
	if(0 <= swap.s[2])
		score[swap.s[2]]++;
	if(0 <= swap.s[3])
		score[swap.s[3]]++;
}

This should work (please note that I typed it directly into my mail 
client so it may need some tweaking) and it should be slightly faster 
than what you had before.

Note that what your doing, especially in a tight loop, will kill 
performance.  If there is a way to save all the indexes that you've 
calculated in the vector code and then only update the score array once 
you're done doing vector calculations, that will be a lot cleaner and 
potentially much faster.  Also note that you can actually index into 
small tables using the vector unit, but the tables have to be 
relatively small.  Look at the permute instruction for that.

Brendan Younger
_______________________________________________
scitech mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/scitech
Do not post admin requests to the list. They will be ignored.
_______________________________________________
scitech mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/scitech
Do not post admin requests to the list. They will be ignored.


References: 
 >Re: Altivec: Extracting Floats From Vector Float (From: Brendan Younger <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.