Well, I spent quite a while on ADC and came up empty-handed.
I actually wrote my own routine that did the job, and generated really
similar code. But I would rather call out to an Apple routine instead
of making my own, since they have the opportunity to super-fine-tune
it or redesign it when a new CPU type comes out.
Honestly, I was also hoping that vSaxpy would work on G3s as well,
sparing me the effort of writing separate code for G3s and G4s. I was
under the impression that Accelerate would work on G3s, just more
slowly. But looking at the implementation in gdb, it sure looks like
it requires a G4 just to get off the ground. (Unless Mach loads in a
different library entirely for G3s...??)
Mach-o supports split files with multiple forks for different
architectures or cpus. Almost nobody uses this feature at the moment,
but this is subject to change. You can use the command line "file"
command to see if a file is split into multiple forks, and which forks
are present. If you look today, you will find that vecLib is not split
into multiple forks. That is also subject to change.
saxpy is part of BLAS, which is an industry standard linear algebra
package that has been around for quite some time. The version in
cblas.h should work fine on G3. I think there are a number of things
that would work for you:
vDSP.h: vsmul() and vadd() (works on G3 and G4)
cblas.h: cblas_saxpy() (works on G3 and G4)
vBLAS.h: SAXPY()
The duplication of the BLAS library happened because there was an
original BLAS attempt inherited from MacOS 9 that didn't implement the
whole thing and used slightly different names from the original
library. This was retained for backwards compatibility when the full
BLAS (ATLAS) was brought on board more recently.
Sorry about the lack of documentation. The BLAS/LAPACK is fully
documented at netlib.org. You can even go out an buy a book on it.
However there does appear to be a hole in the documentation from the
perspective of the search engine.
Performance note: most linear algebra functions that operate on 1D
arrays like saxpy are not as efficient as one might imagine. Functions
like this with no possibility for data reuse are typically LSU bound
and don't saturate the arithmetic units. This is not a bus issue. It
happens in L1 cache too. To do one vec_madd() operation (which is what
saxpy does at its core) you have to do two loads and a store to move
data in and out of register. (It would have been three loads, except
that one argument is a scalar.) Even if all the data is in L1 you'll
only be getting at most 1/3 of the vector floating point unit on some
CPUs. If your calculation is more complex than a simple saxpy
operation, you are likely better off writing your own code and merging
all those little operations together into one larger one. This should
eliminate a bunch of those loads and stores. Towards this end, MacSTL
might make your life easier. I haven't used it, but I dimly recall it
is supposed to do template based loop fusion, which is the sort of
thing you need to be able to stitch together many small operations and
get rid of those loads. In addition, we've announced a matrix multiply
function for vImage that might do what you want, if you are using this
for 3D geometry. It is probably also useful for array cross products
and array dot products against a single 3D vector, if you are creative.
It skips 0 elements in the matrix so should be efficient for these
smaller operations. As it is new for X.4, it might be interesting for
your next game, if not the current one.
In general, vImage overhead is pretty small. Typically we check to see
if AltiVec is available, do some error checking which is maybe 4-5
conditionals, then branch to the scalar or vector code. That will then
read in data from the structs, set up a few constants and then get
moving. The overhead for handling multiple rows is in general just a
few instructions. There typically isn't anything to be done except
update a counter and advance two pointers. The following basic loop
architecture is used almost everywhere:
srcData = src->data;
destData = dest->data;
for( y = 0; y < dest->height; y++ )
{
srcRow = srcData;
destRow = destData;
for( x = 0; x < dest->width; x++ )
{
//process a pixel row here
srcRow[x] = destRow[x]; //this just copies data as an example
}
As you can see, the only stuff that would not be done if we had a 1-D
function is the for the pointer arithmetic shown above. There is
typically one such add per buffer handled by the function, and a load
or two at the head of the function to read in the data from the struct.
Some functions do have a large amount of setup overhead. Typically
these have a separate function associated with them that does the setup
for them and dumps the data into an opaque data structure. This
structure is then passed into the real function each time it is called.
To keep the overhead down to a one-time cost, you can reuse this
structure repeatedly. We've been trying to keep vImage usable in a
real-time situations. If we did our job correctly, there should always
be a low-latency non-blocking path to process data quickly.
The remaining set of functions that have a more involved setup are
typically doing some tiling internally either for better cache reuse or
for multithreading or both. You can turn that off using
kvImageDoNotTile if you know your datasets are small. In this case,
you'll bounce back to the fast path single threaded case that operates
on the whole buffer and we wont waste time trying to figure out optimum
tile sizes or waiting for other threads to wake up, do their work, and
complete.
Because of the limitations of the vector engine (up to 8-cycle
pipelines), scanlines narrower than 128 bytes wide might fall back into
a less efficient vector code or scalar code path. In the absence of
time to do competitive benchmarks, I would say that is about where the
cutoff where using vImage on 1D arrays is possibly no longer a large
win.