I'm not certain what exactly this means (I'm still looking through
the references that Ian provided earlier). In this profiling mode do
I want to correlate these percentage of this function with the amount
of time this function takes up in a Variable Time Profile? Or by
virtue of it taking up 44% in the profile window does it automatically
become a candidate for prefetching?
L2 cache misses take a few cycles to detect. Load latency for a L1 miss
/ L2 hit is about 12 cycles or so on G5 (exact number in 970 user
manual), which is about the time it takes to figure out whether the
data is in the L2 cache and then get it up to the LSU. The L2 cache
miss event counters behave simiarly. Thus the program counter may have
advanced significantly in the time between when the load instruction
was issued and the LSU realizes that a L2 cache miss has occurred. The
samples land where the program counter is pointing when the event is
detected, not on the instructions that cause them.
I think in this case, the delay not so bad. The data is loaded using
lfsux (isn't that cracked/microcoded on G5?) and stuck in register f4.
The next instruction that actually uses f4 is the fsub instruction that
follows the load and is taking all the samples from Shark. It can't
proceed until the data appears, though on G5 it will at least make it
into the issue queues. It looks like we are running on a G4 here?
Compile with -g if Shark isn't showing which lines of source this loop
correspond to.
I'd suggest two things here. One is to prefetch about 128 bytes ahead
of where that load is getting its data. The other one is to see about
removing some of that branchiness so that you can unroll this loop. If
the stride between loads is only 4 bytes, then maybe you are out of
luck and the hardware prefetch engine is already prefetching here. (You
still take cache misses because the code outruns the prefetch enginel.)
In that case the only way to speed this up would be to find a way to do
that operation with the data already in the caches. You might tag this
code onto the end of whatever function produced that data, or
investigate some sort of tiling method.
If the L2 miss detection latency is causing problems figuring out which
load it was, you can frequently sample first on L2 cache misses to find
out where in general you are missing the caches. Then sample again on
L1 cache misses and then go look at the l1 cache misses where you were
seeing L2 cache misses. The reason this works is that all L2 cache
misses are also L1 cache misses. The big difference is that L1 cache
misses are reported very quickly so its a bit easier to figure out
which loads are stalling.
Another approach is to look for patterns. Sometimes you'll have a
pattern of loads with a very pretty set of sample shadows in the same
pattern a little later on. This happens mostly on G4 where the
instruction completion queue fills up during load stalls. G5 can hold
many times more instructions in flight so isn't as systematic about
delivering patterned shadows.