As far as I can tell, the non-unit strides in the calls made by SBDSQR
to srot_ *cannot* be avoided — they are essential to the "chase"
algorithm used by the SVD to reduce the bidiagonal intermediary to
final diagonal form (with the singular values lying on the diagonal).
The rotations computed there are also applied to large dense
rectangular and square matrices to develop the left and right singular
vectors. Even if the "iteration count" passed to srot_ is large, the
memory accesses are widely strided (by either "nRows" or "nCols"
elements) from one iteration to the next. That's a sure prescription
for a memory bounded algorithm.
Well, thanks for looking anyway. It's nice to know that it's not
because I've done something immediately stupid, like poor memory layout
or something.
Have you tried to solve your least-squares problems using something
other than the SVD (i.e. one of the orthogonal decompositions)? The
SVD is a big and costly hammer.
This is where my knowledge of numerical methods, and specifically
LAPACK falls down.
The large matrix describing the equations that I want to solve (A) is
extremely sparse. In general I would expect only 2 or 3 entries to be
non-zero on any row. Unfortunately the form of the matrix is rather
random as it is formed by taking random samples from data, and
therefore doesn't have a predicable shape. My knowledge of numerical
methods suggested that with such a random population of elements and
the over-determined nature of the system that a general SVD routine
would be the only option.