On Thu, 21 Oct 2004, Ian Ollmann wrote:
> Intel gets quite far with a combination of high clock frequency, low
> latencies, and I'm guessing some aggressive instruction rescheduling
> and store forwarding. These things work well for simple, non-unrolled
> code.
>
Intel's optimization guide for Pentium 4 discourages loop unrolling. AMD's
for Athlon (XP|64), however, recommends it. Old optimization lore on WWW
and usenet encourages unrolling for Pentium 3, which is quite relevant
today for Pentium M, as that is a direct descendant of the P3 core.
My personal experience with Pentium 2 and Pentium 3 is that modest
unrolling (by two or sometimes by three) often helps, and does not hurt
Pentium 4. I noticed that Pentium 4 generally gains notably less from
manual tunig than other processors do. It's just not easy to keep these
monster pipelines fed from tiny register files.
The most effective low level optimization I found for 'x86 CPUs is to
tweak data flow such that the destructive operators (+=, <<=, &=, etc.)
are favoured over the usual infix operators. They are a natural fit to the
native 'x86 instruction encoding and they reduce register pressure as
well.
Holger
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden
This email sent to email@hidden