Brendan,
Your suggestion helped a lot. On my first test, it was about 25%
faster...
Roger
-----Original Message-----
From: email@hidden
[mailto:email@hidden] On Behalf Of Brendan Younger
Sent: Monday, April 05, 2004 10:32 PM
To: Roger Kylin
Cc: 'Apple SciTech'
Subject: Re: Altivec: Extracting Floats From Vector Float
On Monday, April 5, 2004, at 07:04 AM, Roger Kylin wrote:
> I did some playing around with vec_ste and found some surprising
> results, and at least one question
>
> If I have
>
> vector float f={1,2,3,4};
> float f1,f2,f3,f4;
>
> vec_ste(f,0,&f1); printf("%f\n",f1);
> vec_ste(f,1,&f2); printf("%f\n",f2);
> vec_ste(f,2,&f3); printf("%f\n",f3);
> vec_ste(f,3,&f4); printf("%f\n",f4);
>
> The output is:
>
> 1
> 2
> 3
> 4
>
> But, if I do:
>
> vector float f={1,2,3,4};
> float f1;
>
> vec_ste(f,0,&f1); printf("%f\n",f1);
> vec_ste(f,1,&f1); printf("%f\n",f1);
> vec_ste(f,2,&f1); printf("%f\n",f1);
> vec_ste(f,3,&f1); printf("%f\n",f1);
>
> The output is:
>
> 1
> 1
> 1
> 1
>
> Why are the subsequent calls to vec_ste overwriting the value f1?
I assume you mean why are they _not_ overwriting f1? Look again at the
documentation for vec_ste(). Everything the vector unit does is
aligned on 16 byte boundaries and AltiVec will not, under any
circumstances, allow you to break this general rule. What the
load/store element instructions do is to simply load/store an element
at the same position it would normally have been at had you used
vec_ld() or vec_st() but without touching the neighboring elements in
memory. Also, in your previous code example, you got really, really
lucky that f1,f2,f3,f4 were aligned correctly. The only way to make
sure they are at the correct place is to do the following:
union {
vector float v;
float s[4];
} swap;
This will guarantee that the floats s[0]..s[3] are correctly aligned.
There is no other way to do it in C. (Of course, you could use GCC's
__aligned__ attribute, but it's not exactly portable.)
Now, back to your question. Check out the results of this:
vector float example = (vector float)(1.0, 2.0, 3.0, 4.0);
vec_ste(example, 0, &swap.s[0]); printf("%f\n", swap.s[0]);
vec_ste(example, 1, &swap.s[0]); printf("%f\n", swap.s[1]);
vec_ste(example, 2, &swap.s[0]); printf("%f\n", swap.s[2]);
vec_ste(example, 3, &swap.s[0]); printf("%f\n", swap.s[3]);
This explains why you don't see swap.s[0] being overwritten with 2.0,
3.0, or 4.0, store element is like a restricted vec_st(). For more fun
with the implicit rounding of the address down to the greatest multiple
of 16 less than the given address, try replacing &swap.s[0] with
&swap.s[1] or &swap.s[2], etc. in the vec_ste() calls.
> The surprising result
>
> Old code (buried within some for and if loops):
>
> Vector float zero_vec={0.,0.,0.,0.};
> Vector float score_vec;
> Float ff[4];
>
> If(vec_any_gt(score_vec,zero_vec)){
> GetFourFloats(score_vec,ff);
> for(i=0;i<4;i++)
> i1=(int)ff[i]-1;
> if(i1>=0){score[i1]=score[i1]+1;}
> }
> }
>
> , where GetFourFloats copies score_vec into ff via a union.
>
> New code:
>
> Vector float zero_vec={0.,0.,0.,0.};
> Vector float score_vec;
> Float ff[4];
>
> If(vec_any_gt(score_vec,zero_vec)){
> for(i=0;i<4;i++)
> vec_ste(score_vec,I,ff[i]);
> i1=(int)ff[i]-1;
> if(i1>=0){score[i1]=score[i1]+1;}
> }
> }
>
>
> Running the code the old way took 56 seconds, the new way took 64
> seconds... Any guesses why using vec_ste was slower?
The vec_ste() code is slower for the very good reason that you're not
using it like it was designed. The entire point of load/store element
is to use the vector unit for things you would normally have to use the
scalar unit for. In your case, you are still using the scalar unit to
perform a calculation, so there's really no point in using vec_ste() at
all. In your case, since you're using the result to index into an
array, there's no escape but to use the scalar units. However, there
are a few optimizations you could still make:
vector float one_vec = (vector float)(1.0);
vector float score_vec;
if(!vec_all_lt(score_vec, one_vec)) {;
union {
vector signed int v;
signed int s[4];
} swap;
swap.v = vec_cts(vec_sub(score_vec, one_vec));
if(0 <= swap.s[0])
score[swap.s[0]]++;
if(0 <= swap.s[1])
score[swap.s[1]]++;
if(0 <= swap.s[2])
score[swap.s[2]]++;
if(0 <= swap.s[3])
score[swap.s[3]]++;
}
This should work (please note that I typed it directly into my mail
client so it may need some tweaking) and it should be slightly faster
than what you had before.
Note that what your doing, especially in a tight loop, will kill
performance. If there is a way to save all the indexes that you've
calculated in the vector code and then only update the score array once
you're done doing vector calculations, that will be a lot cleaner and
potentially much faster. Also note that you can actually index into
small tables using the vector unit, but the tables have to be
relatively small. Look at the permute instruction for that.
Brendan Younger
_______________________________________________
scitech mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/scitech
Do not post admin requests to the list. They will be ignored.
_______________________________________________
scitech mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/scitech
Do not post admin requests to the list. They will be ignored.