My personal experience with Pentium 2 and Pentium 3 is that modest
unrolling (by two or sometimes by three) often helps, and does not hurt
Pentium 4. I noticed that Pentium 4 generally gains notably less from
manual tunig than other processors do. It's just not easy to keep these
monster pipelines fed from tiny register files.
The trace cache seems likely to pay off much better if it has to
translate less code, and its reward seems likely to be greater the more
times a loop goes around.
I think also that when largish issue queues like G5 are used to do
instruction reordering, the queues in effect do the unrolling for you,
as long as you don't run out of issue queue slots and rename registers.
The queues may do it better because there are many more renames than
named registers. Unfortunately, if you unroll by hand but end up doing
extra work to work around limited register availability, their ability
to do that is lessened. The demands on queue length and rename
availability become that much larger with each new instruction in the
loop, in order to unroll the loop in the queues.