Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: L2 Cache Miss (was: Floating Point comparison G5)




On Jan 9, 2005, at 4:38 PM, David Gohara wrote:


In profiling with Shark using Ian's suggestion for L2 cache misses. I got the follow:

44% was on one function from my program
40% was from the mach kernel library (pmap_zero_page)

  The assembly for my code from this profile where it hotspots is:

0xe9f0 cmpwi cr5,r5,-3 1:1
0xe9f4 cmpwi cr2,r5,-1 1:1
0xe9f8 ori r28,r9,0x0000 1:1
0xe9fc ori r27,r8,0x0000 1:1
0xea00 ori r26,r7,0x0000 1:1
0xea04 mtctr r30 *2:2
0xea08 add r30,r25,r17 1:1
0xea0c add r25,r25,r13 1:1
0xea10 lfsux f4,r10,r2 4:1 ! Stall=1, Loop start[20], Unroll, AltiVec
0xea14 cmpwi r5,0 1:1
0xea18 addi r13,r6,1 1:1
52.9% 52.3% 0xea1c fsub f4,f31,f4 5:1 Stall=4
0.0% 0.0% 0xea20 fmadd f4,f4,f4,f5 5:1
0xea24 beq $+900 <cards13and14_ + 9480>1:1
0xea28 cmpw cr3,r13,r31 1:1
0xea2c beq cr3,$+724 <cards13and14_ + 9312>1:1
0xea30 cmpw r14,r31 1:1
0xea34 beq $+716 <cards13and14_ + 9312>1:1
0xea38 beq cr1,$+660 <cards13and14_ + 9260>1:1
0xea3c cmpw r24,r31 1:1
0xea40 bne $+604 <cards13and14_ + 9212>1:1
0xea44 lfs f5,0(r29) 4:1 Stall=3
0xea48 fsubs f5,f2,f5 5:1 Stall=4
0xea4c fmadds f7,f5,f5,f7 5:1
0xea50 b $+588 <cards13and14_ + 9212> 1:1


I'm not certain what exactly this means (I'm still looking through the references that Ian provided earlier). In this profiling mode do I want to correlate these percentage of this function with the amount of time this function takes up in a Variable Time Profile? Or by virtue of it taking up 44% in the profile window does it automatically become a candidate for prefetching?


L2 cache misses take a few cycles to detect. Load latency for a L1 miss / L2 hit is about 12 cycles or so on G5 (exact number in 970 user manual), which is about the time it takes to figure out whether the data is in the L2 cache and then get it up to the LSU. The L2 cache miss event counters behave simiarly. Thus the program counter may have advanced significantly in the time between when the load instruction was issued and the LSU realizes that a L2 cache miss has occurred. The samples land where the program counter is pointing when the event is detected, not on the instructions that cause them.

I think in this case, the delay not so bad. The data is loaded using lfsux (isn't that cracked/microcoded on G5?) and stuck in register f4. The next instruction that actually uses f4 is the fsub instruction that follows the load and is taking all the samples from Shark. It can't proceed until the data appears, though on G5 it will at least make it into the issue queues. It looks like we are running on a G4 here?

Compile with -g if Shark isn't showing which lines of source this loop correspond to.

I'd suggest two things here. One is to prefetch about 128 bytes ahead of where that load is getting its data. The other one is to see about removing some of that branchiness so that you can unroll this loop. If the stride between loads is only 4 bytes, then maybe you are out of luck and the hardware prefetch engine is already prefetching here. (You still take cache misses because the code outruns the prefetch enginel.) In that case the only way to speed this up would be to find a way to do that operation with the data already in the caches. You might tag this code onto the end of whatever function produced that data, or investigate some sort of tiling method.

If the L2 miss detection latency is causing problems figuring out which load it was, you can frequently sample first on L2 cache misses to find out where in general you are missing the caches. Then sample again on L1 cache misses and then go look at the l1 cache misses where you were seeing L2 cache misses. The reason this works is that all L2 cache misses are also L1 cache misses. The big difference is that L1 cache misses are reported very quickly so its a bit easier to figure out which loads are stalling.

Another approach is to look for patterns. Sometimes you'll have a pattern of loads with a very pretty set of sample shadows in the same pattern a little later on. This happens mostly on G4 where the instruction completion queue fills up during load stalls. G5 can hold many times more instructions in flight so isn't as systematic about delivering patterned shadows.

Ian


_______________________________________________ Do not post admin requests to the list. They will be ignored. PerfOptimization-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Floating Point comparison G5 vs. Opteron (64-bit) question (From: David Gohara <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Marco Scheurer <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Ian Ollmann <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Ian Ollmann <email@hidden>)
 >Re: L2 Cache Miss (was: Floating Point comparison G5) (From: David Gohara <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.