Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Accelerate library




On Oct 26, 2004, at 7:54 PM, John Stiles wrote:



Well, I spent quite a while on ADC and came up empty-handed.
I actually wrote my own routine that did the job, and generated really similar code. But I would rather call out to an Apple routine instead of making my own, since they have the opportunity to super-fine-tune it or redesign it when a new CPU type comes out.
Honestly, I was also hoping that vSaxpy would work on G3s as well, sparing me the effort of writing separate code for G3s and G4s. I was under the impression that Accelerate would work on G3s, just more slowly. But looking at the implementation in gdb, it sure looks like it requires a G4 just to get off the ground. (Unless Mach loads in a different library entirely for G3s...??)

Mach-o supports split files with multiple forks for different architectures or cpus. Almost nobody uses this feature at the moment, but this is subject to change. You can use the command line "file" command to see if a file is split into multiple forks, and which forks are present. If you look today, you will find that vecLib is not split into multiple forks. That is also subject to change.


saxpy is part of BLAS, which is an industry standard linear algebra package that has been around for quite some time. The version in cblas.h should work fine on G3. I think there are a number of things that would work for you:

vDSP.h:  vsmul() and vadd()                (works on G3 and G4)
cblas.h:    cblas_saxpy()                    (works on G3 and G4)
vBLAS.h:    SAXPY()

The duplication of the BLAS library happened because there was an original BLAS attempt inherited from MacOS 9 that didn't implement the whole thing and used slightly different names from the original library. This was retained for backwards compatibility when the full BLAS (ATLAS) was brought on board more recently.

Sorry about the lack of documentation. The BLAS/LAPACK is fully documented at netlib.org. You can even go out an buy a book on it. However there does appear to be a hole in the documentation from the perspective of the search engine.

Performance note: most linear algebra functions that operate on 1D arrays like saxpy are not as efficient as one might imagine. Functions like this with no possibility for data reuse are typically LSU bound and don't saturate the arithmetic units. This is not a bus issue. It happens in L1 cache too. To do one vec_madd() operation (which is what saxpy does at its core) you have to do two loads and a store to move data in and out of register. (It would have been three loads, except that one argument is a scalar.) Even if all the data is in L1 you'll only be getting at most 1/3 of the vector floating point unit on some CPUs. If your calculation is more complex than a simple saxpy operation, you are likely better off writing your own code and merging all those little operations together into one larger one. This should eliminate a bunch of those loads and stores. Towards this end, MacSTL might make your life easier. I haven't used it, but I dimly recall it is supposed to do template based loop fusion, which is the sort of thing you need to be able to stitch together many small operations and get rid of those loads. In addition, we've announced a matrix multiply function for vImage that might do what you want, if you are using this for 3D geometry. It is probably also useful for array cross products and array dot products against a single 3D vector, if you are creative. It skips 0 elements in the matrix so should be efficient for these smaller operations. As it is new for X.4, it might be interesting for your next game, if not the current one.

In general, vImage overhead is pretty small. Typically we check to see if AltiVec is available, do some error checking which is maybe 4-5 conditionals, then branch to the scalar or vector code. That will then read in data from the structs, set up a few constants and then get moving. The overhead for handling multiple rows is in general just a few instructions. There typically isn't anything to be done except update a counter and advance two pointers. The following basic loop architecture is used almost everywhere:

srcData = src->data;
destData = dest->data;

for( y = 0; y < dest->height; y++ )
{
srcRow = srcData;
destRow = destData;

for( x = 0; x < dest->width; x++ )
{
//process a pixel row here
srcRow[x] = destRow[x]; //this just copies data as an example
}

srcData += srcRowBytes;
destData += destRowBytes;
}

As you can see, the only stuff that would not be done if we had a 1-D function is the for the pointer arithmetic shown above. There is typically one such add per buffer handled by the function, and a load or two at the head of the function to read in the data from the struct.

Some functions do have a large amount of setup overhead. Typically these have a separate function associated with them that does the setup for them and dumps the data into an opaque data structure. This structure is then passed into the real function each time it is called. To keep the overhead down to a one-time cost, you can reuse this structure repeatedly. We've been trying to keep vImage usable in a real-time situations. If we did our job correctly, there should always be a low-latency non-blocking path to process data quickly.

The remaining set of functions that have a more involved setup are typically doing some tiling internally either for better cache reuse or for multithreading or both. You can turn that off using kvImageDoNotTile if you know your datasets are small. In this case, you'll bounce back to the fast path single threaded case that operates on the whole buffer and we wont waste time trying to figure out optimum tile sizes or waiting for other threads to wake up, do their work, and complete.

Because of the limitations of the vector engine (up to 8-cycle pipelines), scanlines narrower than 128 bytes wide might fall back into a less efficient vector code or scalar code path. In the absence of time to do competitive benchmarks, I would say that is about where the cutoff where using vImage on 1D arrays is possibly no longer a large win.

Ian




_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Accelerate library (From: John Stiles <email@hidden>)
 >Re: Accelerate library (From: David Duncan <email@hidden>)
 >Re: Accelerate library (From: John Stiles <email@hidden>)
 >Re: Accelerate library (From: David Duncan <email@hidden>)
 >Re: Accelerate library (From: John Stiles <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.