If your operations are symmetric (e.g. a 2D separable convolution),
then one approach is to write it as a 1D convolve + transpose. Load
data in, convolve, transpose the results, store out, repeat. Simply
call it twice. While nice in theory, this means that you will be
storing data non-linearly. This can be a particularly costly thing to
do on G5 with its small caches. Non linear stores, even when all of
them hit L1 can be up to 8x slower than linear ones, so should be
avoided. Non-linear loads tend to work a bit better, especially if you
prefetch, though of course front side bandwidth may quickly become an
issue. Some form of tiling with linear stores is likely the optimal
solution.
Where possible, a better approach is to do the linear pass first, then
rewrite the non-linear pass so that it is linear. How is this done? By
reordering the way you stride through data. Continuing our example with
the vertical 1D convolve, rather than walking down a single column of
results before doing the next column, one could do all the results from
one or more rows before moving to the next set of rows. This will mean
that the code for your vertical pass might look nothing like the
horizontal pass, but it will be considerably faster!
The power of 2 issue is even more important on G5. You can take
load/store dispatch group rejects (like what stalls float<->int
conversions) due to false aliasing from stores N*65536 bytes away from
loads in the same dispatch group.
The mtrans operation is pretty quick, but in the end it has to put up
with the same caching issues you do, so it is possibly not a cure-all.