Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Hotspot in Accelerate Framework




On Jan 3, 2005, at 2:13 AM, Paul Sargent wrote:



I've submitted a bug at Ian O's suggestion (3936624).

Which I've diagnosed as:

ATL_srot_xp0yp0aXbX is the ATLAS BLAS fallback case for non-unit stride along X or Y. This case is *not* aggressively optimized because it almost certain to be *memory* bound. The Shark trace shows nearly 25% of the total run time consumed by loading operands from memory in this routine. The stalls that Shark flags are actually inconsequential (on the 970 anyway) — they occur while waiting for memory operands to arrive. [Note that Shark does *not* model memory access, it assumes operands are all in the nearby L1.]
[Note also that the *unit* stride case is unrolled to good advantage in Apple's libBLAS.dylib.]


As far as I can tell, the non-unit strides in the calls made by SBDSQR to srot_ *cannot* be avoided — they are essential to the "chase" algorithm used by the SVD to reduce the bidiagonal intermediary to final diagonal form (with the singular values lying on the diagonal). The rotations computed there are also applied to large dense rectangular and square matrices to develop the left and right singular vectors. Even if the "iteration count" passed to srot_ is large, the memory accesses are widely strided (by either "nRows" or "nCols" elements) from one iteration to the next. That's a sure prescription for a memory bounded algorithm.

Have you tried to solve your least-squares problems using something other than the SVD (i.e. one of the orthogonal decompositions)? The SVD is a big and costly hammer.

SCP



The call which causes this is as follows

    sgelss_(&nRows, &nColumns, &nRHS,
            (float *)[equationMatrixRedA bytes], &nRows,
            (float *)[equationMatrixRedb bytes], &nRows,
            (float *)[sArray bytes], &rCond,
            &rank,
            (float *)[workMatrix bytes], &workSize,
            &info);

where
  nRows    ~=  1800
  nColumns ~=  450
  nRHS      =  1
  rCond     = -1

equationMatrixRedA is a NSData nRows * nColums * sizeof(float) in size.
equationMatrixRedB is a NSData nRows * nRHS * sizeof(float) in size.
sArray is a NSData MIN(nRows,nColumns) * sizeof(float) in size.
workMatrix is a NSData workSize in size.


workSize is obtained by doing a query call to sgells_ (i.e. worksize set to -1 all other parameters the same), just previous to the main call.

The call stack is given below. There only appears to be the one call path.

libBLAS.dylib        ATL_srot_xp0yp0aXbX
libLAPACK.dylib    f2c_srot
libLAPACK.dylib    SBDSQR
libLAPACK.dylib    sgelss_

 Paul



Steve Peters
email@hidden

_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Re: Hotspot in Accelerate Framework (From: Steve Peters <email@hidden>)
 >Re: Hotspot in Accelerate Framework (From: Paul Sargent <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.