When I look into the assembly in Shark,
The unrolled part is full of stalls.
I suspect one of the reason of stalls is
the vector register 'congestion'.
The assembly repeatedly uses v0, v1, v10, v11, v12, v13
and there are other 11 vector registers occupied by
an outer loop pt0~pta (44 dimensions, 11 vector registers).
Since there are 32 vector registers in a G4,
there are about 32-(11+6) = 15 idle vector registers,
which could have been used in the computation to reduce stalls.
Looks like a case of bad scheduling by the compiler. It might help to
write the unrolled loop body like this:
Stupid stuff to check:
1) Optimizer is turned on. Xcode/ProjectBuilder have multiple places
where a -O0 might sneak into your otherwise -O3 build.
2) There is no aliasing going on. Typically the compiler will unroll
and interleave correctly. The only times I've seen it not do it is if
you have code that looks like this
load data
do calculation
store data
load data
do calculation
store data
load data
do calculation
store data
load data
do calculation
store data
In such cases, the compiler may not be smart enough to figure out
whether the stores and loads overlap and be forced to do the stores and
loads in exactly the order you wrote. Everything else has to follow
along. Simply breaking the load/store order should fix that:
load data
do calculation
load data
do calculation
load data
do calculation
load data
do calculation
store data
store data
store data
store data
though it can be more readable if you do it the way Holger suggested.
Note that reordering non-dependent lines of C code on an optimizing
compiler typically has no significant effect -- the compiler is going
to reorder them the way it wants to anyway. One just has to be able to
predict what the compiler will think is dependent and what is not, and
order the dependent stuff the way you think will lead to the best speed
improvement.
All bets are off with GCC-2.95. That would sometimes spontaneously
decide that it only had 8 or 24 vector registers and spill accordingly.
Get a newer compiler.