Using the apple blas library improves speed my about 5%-10%,
compared with hand coding the determinant.
Any ideas on how to get it faster?
Can you work on multiple matrices at a time, and can you rearrange
them so the data from different matrices is interleaved? That would
allow you to use vector operations efficiently to do all the arithmetic.
E.g., suppose you have four matrices, so that A[i][j][k] is the (j, k)
element of matrix i. (This using C notation and layout, so A[i][j][k]
is adjacent to A[i][j][k+1] in memory.) If you can instead have the
data arranged so the (j, k) element of matrix i is at A[j][k][i], then
elements from different matrices will be adjacent in memory. You will
also need A to be aligned to a multiple of 16 bytes.
Then you can simply write all the determinant arithmetic as if it were
for a single 4*4 matrix, but use vector code instead. This should be
significantly faster than anything that tries to deal with elements in
a single 4*4 matrix. It might even be worth the time it takes to
rearrange the data to and from the desired format.
On Nov 1, 2007, at 9:36, Daniel J Farrell wrote:
In your experience would it we worth 90+ calls to vDSP_mul?
I would not expect vDSP_vmul (there is no vDSP_mul) to work well with
elements inside a 4*4 matrix or with strided elements. It is best with
long consecutive sequences.
—edp
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/scitech/email@hidden