When we optimize scalar code, there are several optimizations that
crop up frequently:
1) load and store several data at a time.
The load store unit is a bit weak compared to G5, so you can often
get better throughput if you load
4 consecutive uint8_t's as a uint32_t and then split it apart using
shifts and masks. The same goes for
stores.
2) single precision is cheaper than double precision
3) Do float <-> int conversions intelligently.
The compiler often does not do them intelligently. If you need to do
a lot of them, you can use the algorithm in
the IBM PowerPC Compiler Writers Guide. Small tweaks on this can
give you saturated clipping and round to nearest
behavior for free.
(Note: example shows intelligent int->float conversion)
4) SIMD within a register
Extra clever programmers will find ways to do two or more multiplies
or adds with a single add or multiply operation
by stuffing two or more pieces of data in one 32-bit int.
Parallelizing Boolean operations in this way is trivial.
5) rlwimi is your friend
You can often save a lot of work with this instruction and the
compiler doesn't emit it very often
6) A limited amount of unrolling is usually helpful, usually 2-4 way
Unroll in parallel, not in series. The compiler will often do the
latter due to unnecessary aliasing worries.
You may need to break the order of loads and stores or use the
restrict keyword.
7) Use SimG4
The G3 is sufficiently like the early G4's that you can still get
some good tips from SimG4 about how to optimize
your code. Schedule things so that you are dispatching two
instructions every cycle. SimG4 is (was?) distributed with
CHUD / Shark. Apple recently started shipping SimG4+, which is for
the 7450. There is also a SimG5. These come
courtesy of IBM/Freescale. Intel does not release simulators.
If you really want to learn how to do high performance programming,
the simulator is thing that will teach you
how. It is the last thing most people look at, but should be the
first, in my opinion.
This is really the most important thing. Look at the inner loop in
Shark or better yet look at it in SimG4 and
ask yourself if the compiler really did what you thought it was
going to do. Usually it doesn't, and you may need
to change a couple of things to get the code you wanted. In rare
cases you'll need to resort to asms or even
writing the function in assembly to get the right output. A common
practice is to think in assembly but write in C.
9) load/store with update
As Holger mentioned, these perform well on G3. When you only have
two dispatch slots, each of these will save you
an add, which shaves a half cycle off your loop for each one you use
on average.
Speedups from this sort of thing can range from 1.5 to 3x in the
hands of an experienced programmer, so it is worth doing.