Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: G3 optimization data?




When we optimize scalar code, there are several optimizations that crop up frequently:


1) load and store several data at a time.

The load store unit is a bit weak compared to G5, so you can often get better throughput if you load
4 consecutive uint8_t's as a uint32_t and then split it apart using shifts and masks. The same goes for
stores.


2) single precision is cheaper than double precision


3) Do float <-> int conversions intelligently.

The compiler often does not do them intelligently. If you need to do a lot of them, you can use the algorithm in
the IBM PowerPC Compiler Writers Guide. Small tweaks on this can give you saturated clipping and round to nearest
behavior for free.


4) Software pipelining

	http://developer.apple.com/hardware/ve/software_pipelining.html

	(Note: example shows intelligent int->float conversion)

4) SIMD within a register

Extra clever programmers will find ways to do two or more multiplies or adds with a single add or multiply operation
by stuffing two or more pieces of data in one 32-bit int. Parallelizing Boolean operations in this way is trivial.


5) rlwimi is your friend

You can often save a lot of work with this instruction and the compiler doesn't emit it very often

6) A limited amount of unrolling is usually helpful, usually 2-4 way

Unroll in parallel, not in series. The compiler will often do the latter due to unnecessary aliasing worries.
You may need to break the order of loads and stores or use the restrict keyword.


7) Use SimG4

The G3 is sufficiently like the early G4's that you can still get some good tips from SimG4 about how to optimize
your code. Schedule things so that you are dispatching two instructions every cycle. SimG4 is (was?) distributed with
CHUD / Shark. Apple recently started shipping SimG4+, which is for the 7450. There is also a SimG5. These come
courtesy of IBM/Freescale. Intel does not release simulators.


If you really want to learn how to do high performance programming, the simulator is thing that will teach you
how. It is the last thing most people look at, but should be the first, in my opinion.


	http://developer.apple.com/hardware/ve/performance.html

8) Look at the disassembly

This is really the most important thing. Look at the inner loop in Shark or better yet look at it in SimG4 and
ask yourself if the compiler really did what you thought it was going to do. Usually it doesn't, and you may need
to change a couple of things to get the code you wanted. In rare cases you'll need to resort to asms or even
writing the function in assembly to get the right output. A common practice is to think in assembly but write in C.


9) load/store with update

As Holger mentioned, these perform well on G3. When you only have two dispatch slots, each of these will save you
an add, which shaves a half cycle off your loop for each one you use on average.


Speedups from this sort of thing can range from 1.5 to 3x in the hands of an experienced programmer, so it is worth doing.

Ian
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >G3 optimization data? (From: Rustam Muginov <email@hidden>)
 >Re: G3 optimization data? (From: Holger Bettag <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.