I am doing something more than memory copy but I always start it with
understanding memory copy issues on the target architecture. I
tried STREAM
benchmark on the G5 which gives 2GB/s copy bandwidth, while on an
Intel
Xeon 2.4GHZ I got 4.5GB/S with SSE2 using non-temporal writes. G5
has a
theoretical memory bandwidth of 6.4 GB/s which is much higher than
2GB/s I got and also higher than the theoretical memory bandwidth of
that intel Xeon. I later tried to write my own copy routine using
Altivec, the performace was only improved a little. I tried memcpy
and
memmove, the result
is 3.1 GB/s, which is better but still less than half of 6.4GB/s.
Keep in mind that the G5's use PC3200 memory, which has a
theoretical limit of 3.2GB/sec. So 3.1 GB/sec is pretty good. The
6.4GB/sec is just the cpu interface. And I'm not sure, but it might
even be that this is 3.2GB/sec in and 3.2GB/sec out (the G5 has two
32 bit busses which can be used simultaneously, one for incoming
and one for outgoing data -- but I don't know whether they're both
rated at 6.4GB/sec or whether that's the combined bandwidth).
The G5 memory controller is dual channel though. There's 6.4GB/
s available at the memory pins - but at 1GHz FSB the max read speed
at the CPU side is going to be 3.2GB/s or slightly less. If you are
able to copy 3.2GB/s or somewhere in that ball park, you're getting
close to "speed of light" with it.
A "perfect" memory controller would not be able to copy more
than 3.2GB/s using a 6.4GB/s memory subsystem since the memory can't
read and write simultaneously.