I am doing something more than memory copy but I always start it with
understanding memory copy issues on the target architecture. I
tried STREAM
benchmark on the G5 which gives 2GB/s copy bandwidth, while on an
Intel
Xeon 2.4GHZ I got 4.5GB/S with SSE2 using non-temporal writes. G5
has a
theoretical memory bandwidth of 6.4 GB/s which is much higher than
2GB/s I got and also higher than the theoretical memory bandwidth of
that intel Xeon. I later tried to write my own copy routine using
Altivec, the performace was only improved a little. I tried memcpy and
memmove, the result
is 3.1 GB/s, which is better but still less than half of 6.4GB/s.
Keep in mind that the G5's use PC3200 memory, which has a theoretical
limit of 3.2GB/sec. So 3.1 GB/sec is pretty good. The 6.4GB/sec is
just the cpu interface. And I'm not sure, but it might even be that
this is 3.2GB/sec in and 3.2GB/sec out (the G5 has two 32 bit busses
which can be used simultaneously, one for incoming and one for
outgoing data -- but I don't know whether they're both rated at 6.4GB/
sec or whether that's the combined bandwidth).