Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: loop unrolling and AltiVec register utilization




On Oct 29, 2004, at 6:54 AM, Holger Bettag wrote:


On Fri, 29 Oct 2004, Stan Jou wrote:

When I look into the assembly in Shark,
The unrolled part is full of stalls.
I suspect one of the reason of stalls is
the vector register 'congestion'.
The assembly repeatedly uses v0, v1, v10, v11, v12, v13
and there are other 11 vector registers occupied by
an outer loop pt0~pta (44 dimensions, 11 vector registers).
Since there are 32 vector registers in a G4,
there are about 32-(11+6) = 15 idle vector registers,
which could have been used in the computation to reduce stalls.

Looks like a case of bad scheduling by the compiler. It might help to write the unrolled loop body like this:

Stupid stuff to check:

1) Optimizer is turned on. Xcode/ProjectBuilder have multiple places where a -O0 might sneak into your otherwise -O3 build.
2) There is no aliasing going on. Typically the compiler will unroll and interleave correctly. The only times I've seen it not do it is if you have code that looks like this


load data
do calculation
store data
load data
do calculation
store data
load data
do calculation
store data
load data
do calculation
store data

In such cases, the compiler may not be smart enough to figure out whether the stores and loads overlap and be forced to do the stores and loads in exactly the order you wrote. Everything else has to follow along. Simply breaking the load/store order should fix that:

load data
do calculation
load data
do calculation
load data
do calculation
load data
do calculation
store data
store data
store data
store data

though it can be more readable if you do it the way Holger suggested. Note that reordering non-dependent lines of C code on an optimizing compiler typically has no significant effect -- the compiler is going to reorder them the way it wants to anyway. One just has to be able to predict what the compiler will think is dependent and what is not, and order the dependent stuff the way you think will lead to the best speed improvement.

All bets are off with GCC-2.95. That would sometimes spontaneously decide that it only had 8 or 24 vector registers and spill accordingly. Get a newer compiler.

Ian



_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >loop unrolling and AltiVec register utilization (From: Stan Jou <email@hidden>)
 >Re: loop unrolling and AltiVec register utilization (From: Holger Bettag <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.