(reposting after my original message was held up because it was 20 kB)
On Sep 17, 2008, at 6:47 PM, Eric Postpischil wrote:
Generally, you can write transposition code that performs reasonably
well without knowing the specific geometries of various parts of the
memory hierarchies. A matrix transposition blocked for 4-way
associative cache will still perform okay when run with 8-way
associative cache and so on. I would get something working first and
later worry about tuning it for specific systems.
Actually I can just use vImage functions (see my next posting).
I have tested it and it's blazing fast.
Problem solved.