Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: loop unrolling and AltiVec register utilization



Stan Jou wrote:
The original loop was to compute the distance on
every gaussian dimension then sum up all of them.
So I unroll the loop and make it AltiVec as follows
to calculate 4 dimensions at once:

        vector float pt0, rv0, cv0, sa0, di0;
        vector float pt1, rv1, cv1, sa1, di1;
        vector float diS;

        pt0 = vec_ld(16*0, pattern);
        pt1 = vec_ld(16*1, pattern);

        rv0 = vec_ld((16*0), rvP);
        sa0 = vec_sub(rv0, pt0);
        cv0 = vec_ld((16*0), cvP);
        di0 = vec_madd(sa0, sa0, zero);
        diS = vec_madd(cv0, di0, zero);

        rv1 = vec_ld((16*1), rvP);
        sa1 = vec_sub(rv1, pt1);
        cv1 = vec_ld((16*1), cvP);
        di1 = vec_madd(sa1, sa1, zero);
        diS = vec_madd(cv1, di1, diS);
        ...
        ...

However, there is no improvement by this.
I'm probably not the best person to give this a go as I'm learning myself, but I'll try and other people can correct if I've got my ideas wrong.

First thing is that I believe it helps gcc asign registers if you use the register keyword. i.e. "register vector float pt0, rv0, cv0.....". That said, looking at the code it's posible to write it using fewer variables. There's no reason rv0 can't be stored in the same location as sa0 and di0. Similary cv0 and pt0 could be in the same location. That makes 5 registers for the whole loop as you've written it.
  1. rv0, sa0, di0
  2. rv1, sa1, di1
  3. pt0, cv0
  4. pt1, cv1
  5. diS
You said the compiler was using 6 registers. Sounds resonable to me. Doesn't sound like register congestion. Sounds like the compiler optimised the register usage.

Next, the stalls you're getting are probably to do with the fact that although you can issue one altivec instruction per clock it takes several cycles for the result the become available to use. I think normally this is about four cycles, so it's worth trying to make sure that you do useful work in this time.

        pt0 = vec_ld(16*0, pattern);
        pt1 = vec_ld(16*1, pattern);

        rv0 = vec_ld((16*0), rvP);
        sa0 = vec_sub(rv0, pt0); 	// rv0 loaded last cycle   - 3 Cycle Stall
        cv0 = vec_ld((16*0), cvP);
        di0 = vec_madd(sa0, sa0, zero);	// sa0 loaded 2 cycles ago - 2 Cycle Stall
        diS = vec_madd(cv0, di0, zero);	// di0 loaded last cycle   - 3 Cycle Stall

        rv1 = vec_ld((16*1), rvP);	
        sa1 = vec_sub(rv1, pt1);	// rv1 loaded last cycle   - 3 Cycle Stall
        cv1 = vec_ld((16*1), cvP);
        di1 = vec_madd(sa1, sa1, zero);	// sa1 loaded 2 cycles ago - 2 Cycle Stall
        diS = vec_madd(cv1, di1, diS);	// di1 loaded last cycle   - 3 Cycle Stall

A total of 16 stalled cycles, and 12 cycles of useful work. Pretty inefficent. It would be better to rearrange this like so:
        pt0 = vec_ld(16*0, pattern);
        pt1 = vec_ld(16*1, pattern);

        rv0 = vec_ld((16*0), rvP);
        rv1 = vec_ld((16*1), rvP);	
        cv0 = vec_ld((16*0), cvP);
        cv1 = vec_ld((16*1), cvP);

        sa0 = vec_sub(rv0, pt0); 	// rv0 loaded 4 cycles ago - No Stall
        sa1 = vec_sub(rv1, pt1);	// rv1 loaded 4 cycles ago - No Stall
        di0 = vec_madd(sa0, sa0, zero);	// sa0 loaded 2 cycles ago - 2 Cycle Stall
        di1 = vec_madd(sa1, sa1, zero);	// sa1 loaded 2 cycles ago - 2 Cycle Stall
        diS = vec_madd(cv0, di0, zero);	// di0 loaded 2 cycles ago - 2 Cycle Stall
        diS = vec_madd(cv1, di1, diS);	// diS loaded last cycle   - 3 Cycle Stall

There we're stalled for just 9 cycles. Still not perfect, but better. You might be able to put some loads in between some of those madds to try to preload data for the next time round the loop, and use those cycles. The alternative is unroll the loop some more and do more operations interleaved. For example:
        pt0 = vec_ld(16*0, pattern);
        pt1 = vec_ld(16*1, pattern);
        pt2 = vec_ld(16*2, pattern);
        pt3 = vec_ld(16*3, pattern);

        rv0 = vec_ld((16*0), rvP);
        rv1 = vec_ld((16*1), rvP);	
        rv2 = vec_ld((16*2), rvP);
        rv3 = vec_ld((16*3), rvP);	

        sa0 = vec_sub(rv0, pt0); 	// rv0 loaded 4 cycles ago - No Stall
        sa1 = vec_sub(rv1, pt1);	// rv1 loaded 4 cycles ago - No Stall
        sa2 = vec_sub(rv2, pt2); 	// rv2 loaded 4 cycles ago - No Stall
        sa3 = vec_sub(rv3, pt3);	// rv3 loaded 4 cycles ago - No Stall

        di0 = vec_madd(sa0, sa0, zero);	// sa0 loaded 4 cycles ago - No Stall
        di1 = vec_madd(sa1, sa1, zero);	// sa1 loaded 4 cycles ago - No Stall
        di2 = vec_madd(sa2, sa2, zero);	// sa2 loaded 4 cycles ago - No Stall
        di3 = vec_madd(sa3, sa3, zero);	// sa3 loaded 4 cycles ago - No Stall

        cv0 = vec_ld((16*0), cvP);	// Moved down here to reduce register usage
        cv1 = vec_ld((16*1), cvP);
        cv2 = vec_ld((16*2), cvP);
        cv3 = vec_ld((16*3), cvP);

        diS = vec_madd(cv0, di0, zero);	// cv0 loaded 4 cycles ago - No Stall
        diS = vec_madd(cv1, di1, diS);	// diS loaded last cycle   - 3 Cycle Stall
        diS = vec_madd(cv2, di2, diS);	// diS loaded last cycle   - 3 Cycle Stall
        diS = vec_madd(cv3, di3, diS);	// diS loaded last cycle   - 3 Cycle Stall
Still 9 stall cycles, but with 24 useful cycles, effectivly halving the cost of stalls. All using just 7 registers.
I'm not sure how to do that sum at the end nicely though. Maybe somebody else has a suggestion.

Shark tells you the instruction latency (how long it takes to get results back from an instruction) when you look at the disassembly. The x:y column gives this to you. One of the values is how many cycles before the next instruction is issued (the smaller number), the other is how many cycles before the result is available. I can remember which is which though. If you're using shark 4.0 the mixed source/disassembly view is quite good for this as it lets match source lines to instructions.

Paul
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden

References: 
 >loop unrolling and AltiVec register utilization (From: Stan Jou <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.