My question is, when does store-forwarding occur, and can this
mitigate the dispatch-group rejection problem? Also, are there any
rules of thumb for how long to wait between writing to a memory
location and reading from it, or is it enough to simply place the
store and load in separate dispatch groups? To give an example: if I
store a 32-bit word to memory and the next instruction may
incidentally read from it (back into the integer register), how bad is
this, or is it only transfers between different register files that
trigger the problem?
A dispatch group is a series of up to 5 consecutive instructions from
the application's instruction stream. The fifth instruction, if there
is one, must be a branch. As long as the store and load are in
different dispatch groups you wont get a reject, though other lesser
stalls may still occur. GCC works around the rejects in some cases by
padding with up to three noops between store and dependent load. In
your own code, you can do this somewhat more efficiently by unrolling
the loop by four or more and do
In this way, the dependent loads are guaranteed not to fall in the same
dispatch group as the store.
As far as store forwarding goes, as long as the dependent load is not
in the same dispatch group, the reject should not occur, so that is
what I'd shoot for. There is possibly some latency between when a
store finishes and when the data is available for store forwarding. I
haven't measured that latency on G5. The latency on G4 (7450) was about
6 cycles. 6 cycles is a very small problem compared to a dispatch
reject (particularly on a highly out of order machine like G5), so most
of the win is gotten fixing the dispatch reject. Most of the time, this
problem happens in int<->float conversions and scalar<->vector data
moves. Apart from cases such as that that require this kind of data
movement, the compiler usually does not gratuitously spill registers
only to load them back immediately, except for maybe with -O0. With
-O0, store forwarding may indeed save the G5's bacon. Personally, I
haven't seen a case where store forwarding actually does prevent a
dispatch reject, but as I just said, the compiler may not emit them
frequently and I haven't been looking for them. (Only by mistake would
I be tracing code compiled with optimizations off, and I don't recall
seeing that pattern outside of the above described inter-register file
moves.) If you want to experiment, I suggest getting out SimG5 to look
at your code. That cycle accurate cpu simulator should show you exactly
what is supposed to happen. It is installed as part of CHUD. Some
instructions for getting it to work are here:
When writing hand tuned code that requires this sort of data motion, I
typically let the data rest after the store for a whole loop iteration
before loading it back in at the N+2nd iteration. This typically
happens in some code that has been software pipelined
(http://developer.apple.com/hardware/ve/software_pipelining.html), and
is accomplished by simply inserting a software stage in the algorithm
that does nothing between store and load. Such code is typically for
int<->float conversions. The best I've done is scalar code that
outperforms the compiler (with naive int<->float conversion by
typecast) by 33-fold. I think that was before gcc-3.3, though. More
recently, I've been beating it by 6-12x for this sort of simple
function. Most of the win however was no doubt using fctiwz to do
saturation clipping -- typically a requirement of such functions --
rather than rely on the compiler to do it, so wins are likely smaller
than that, perhaps 2x.