On Fri, 15 Oct 2004, Gohara, David wrote:
> Similarly is it more efficient time wise to do the shuffling all at once
> (essentially rotating the array 90 degrees) or doing it by converting
> one column to a row, one at a time? Or is there no difference?
>
If the array is square, there is a particularly efficient way to transpose
it in place.
1. Divide it into four quarters
A B
C D
2. exchange B and C in place (this is the slow part, at least when the
array is large, but you can access memory with stride 1 and get the
best of the bus bandwidth)
A C
B D
3. recurse into each quarter (eventually the a tile will fit in the
cache, regardless of cache size, and the code will run quite fast)
Do not recurse all the way to tile size 2x2, this would have too much
function call overhead. Better to write a special purpose routine for
some minimum tile size. Maybe 4x4, as that would fit completely in
registers, or some 8x8 variant with inlining instead of recursion.
I guess the OS X library routines will actually apply (a generalization
of) this technique.
Holger
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden
This email sent to email@hidden