Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[apple scitech] Re: Accelerated Cartesian Vector Struct (1)





From: Daniel J Farrell <email@hidden>

I am having another 'try the accelerate framework' desire/bug today.
This happens every so often. However, I normally give up because I
can't find any example code that is simple enough for be to follow (I
find the learning curve a bit steep). Can anybody here help me apply
the accelerate framework to the simple example below. It is a
Cartesian vector struct with one operation which finds the length of
the vector.

Regards,

Daniel.


<Note: msg comes in three parts to make list-mom happy>

The Accelerate framework doesn't have good coverage for 3D geometry (except for 4x4 matrix multiplication) in quite the form I imagine you'd like it to be. As a general rule, we try not to provide interfaces that lead people into performance pit traps.* Some of these things are quite subtle and difficult to see until your fall into them. So, in some cases what looks like missing API is really gentle encouragement to go in another direction. I'm sure that the vast majority of omissions in Accelerate framework are due to our own laz^H^H^H limited engineering resources, but in this particular case, we've actually talked about and decided to focus on other areas in Accelerate.framework more likely to deliver customer impact. At least for now...

Why?

Most of the time, people imagine that SIMD vectors will be naturally good at working with 3D vectors -- they are both small vectors, right? Unfortunately, 3D problems are often too small to benefit well from library calls and are usually "phrased" in such a way that can make vectorization with SIMD very hard and often highly inefficient.

As to the first point, takes about 6 (throughput) cycles to navigate the dyld stub to a dynamic library like the core of Accelerate.framework, which is a lot for a problem that might involve just a few multiplies and adds like a dot product. The function call also makes it difficult to get good concurrency. We can hope that the processor's reorder buffer is able to make some sense of multiple small function calls and do some of the work in parallel, but it would be a lot better if the function was just inlined and the work was interleaved somewhat. It takes a lot of extra instructions to save volatile registers, write operands to the stack, (call the function), read the results back off the x87 stack, and reload volatiles. Those instructions fill up the reorder buffer slots with junk that could otherwise be used to parallelize the real work you are looking to do. There is also the question about whether popping in and out of a function call over and over again will cause false (or true) memory aliasing stalls (e.g. function arguments on the stack colliding at the same address across function calls) that might enforce some serialization beyond what you might expect just looking at your source at the C level. That could prevent the reorder buffer from delivering much parallelism between function calls. You can't inline a function from a dynamic library, so unless the compiler inserts the code for you or we stick the code in our headers, you need to roll your own to get it to perform satisfactorily, at least for small problems.

As to the second point, legacy 3D data layouts are highly inconvenient for vector units to use. SIMD vector units (like AltiVec/SSE) are typically designed to largely treat the different elements in the vector the same. In 3D data, the x,y,z and w components that share the same SIMD vector in an interleaved data format often aren't treated the same in arithmetic, especially for things like cross products and quaternions. So, such data formats make for a lot of permute activity in the vector unit, which can take time away from the arithmetic you need to do, especially on Intel. You're at risk of spending most of your time rearranging data in the vector rather than doing actual work. Complex (real+imaginary) data types have similar problems. For further reading on this problem:

http://developer.apple.com/hardwaredrivers/ve/simd.html


Finally, these 3D problems often don't work with enough data to keep the vector unit busy, so you end up with a lot of pipeline bubbles -- wasted opportunities to do work. Typically you need at least 4-8 vectors worth of data to keep the vector engine busy. A single vertex in 3-space is less than one vector worth of data. With scalar code, there is usually at least some level of instruction level parallelism around {i,j,k}/{x,y,z} at that size, meaning that pipelines run fuller. With vector code, you give that up since you are trying to do "everything" in a single instruction. So, in switching to vector code, you in effect give up the instruction level parallelism you already had in your scalar code, in exchange for the explicit parallelism in the vector unit. That swap is often a performance wash when you are data starved. Unfortunately, in the process, you buy yourself some permute headaches which can turn a performance wash into a performance loss. In such cases (common for small dot products), scalar code can actually be faster than vector code.

....

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/scitech/email@hidden

This email sent to email@hidden


Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.