Hi friends,
I have a program doing some statistical computation.
Shark told me the hot spot is on a loop calculating
Malahanobis distances among gaussian mixtures.
The original loop was to compute the distance on
every gaussian dimension then sum up all of them.
So I unroll the loop and make it AltiVec as follows
to calculate 4 dimensions at once:
vector float pt0, rv0, cv0, sa0, di0;
vector float pt1, rv1, cv1, sa1, di1;
vector float diS;
pt0 = vec_ld(16*0, pattern);
pt1 = vec_ld(16*1, pattern);
rv0 = vec_ld((16*0), rvP);
sa0 = vec_sub(rv0, pt0);
cv0 = vec_ld((16*0), cvP);
di0 = vec_madd(sa0, sa0, zero);
diS = vec_madd(cv0, di0, zero);
rv1 = vec_ld((16*1), rvP);
sa1 = vec_sub(rv1, pt1);
cv1 = vec_ld((16*1), cvP);
di1 = vec_madd(sa1, sa1, zero);
diS = vec_madd(cv1, di1, diS);
...
...
However, there is no improvement by this.
When I look into the assembly in Shark,
The unrolled part is full of stalls.
I suspect one of the reason of stalls is
the vector register 'congestion'.
The assembly repeatedly uses v0, v1, v10, v11, v12, v13
and there are other 11 vector registers occupied by
an outer loop pt0~pta (44 dimensions, 11 vector registers).
Since there are 32 vector registers in a G4,
there are about 32-(11+6) = 15 idle vector registers,
which could have been used in the computation to reduce stalls.
If the analysis was correct,
I'm wondering if there is some way to tell gcc to
utilize more vector registers?
The compiler is gcc 3.3 come with Xcode 1.5.
Or could you please suggest some way to do this efficiently?
Thanks in advance! :-)
Stan.
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden
This email sent to email@hidden