Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [apple scitech] Re: Accelerated Cartesian Vector Struct (2)



What's worked fairly well for me is defining two data structures along the lines:

struct Cart3 { float x, y, z; };
struct VCart3 { vFloat x, y, z; };

Then I have a packVCart3() routine which takes an array of Cart3 and repackages it as an array of VCart3. In other words, it takes each group of 4 Cart3 elements and shuffles them around into 3 VCart3 elements.

Before: x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3   x4 y4 z4...
   |------------ group 0 ------------|   |--- group 1 --->
After:  x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3   x4 x5 x6 x7...

(It also pads out the last group with zeros if necessary to form a multiple of four.)

The nice thing about this arrangement is that you don't have to change your scalar code all that much to get it vectorized. You're mostly just substituting VCart3 for Cart3, using the vector library function counterparts from Accelerate, and reducing your loop iterations by a factor of four. There is, of course, a cost in shuffling your arrays around like this (each grouping can be rearranged by 6 shufps/vperm ops, btw), but in cases I have seen, the pre-shuffling loop is typically O(N) while the calculation loop that follows is O(N^2) or higher, so it doesn't really matter.

Be sure to run Shark and figure out where you need to vectorize before you dive into this. Last week, I boosted the speed of a modelling program by a factor of three after touching only two functions, so it can really save you a lot of effort.

-Ted

On 25-Nov-07, at 4:33 AM, Ian Ollmann wrote:


< part 2 of 3 >

This doesn't mean that 3D geometry can't be accelerated efficiently in the vector unit -- even small dot products! Naturally scaling by 4x (or better!) is easily possible. You "just" need to organize your code to take advantage of economies of scale. That is, process a bunch of vertices at once. Unfortunately, that usually means taking a giant wrecking ball to your application core data structures in order to solve the structural problems that are holding the vector unit back.

Namely, replace packed data structures such as this:

/*
	NON-ACCELERATED EXAMPLE

	A Cartesian vector structure and convenience make function.
	Let's also add a simple function to calculate the length
	of the vector.
*/
typedef struct _Vector
{
	float i;
	float j;
	float k;
} Vector;

...with planar array representations something like this:

#define kMyVectorSize	16	/* should be a multiple of 4 */

typedef union VectorOfVertices
{
	struct
	{
		float	 i[ kMyVectorSize ]		__attribute__ ((__aligned__ (16)));
		float	 j[ kMyVectorSize ]		__attribute__ ((__aligned__ (16)));
		float 	 k[ kMyVectorSize ]		__attribute__ ((__aligned__ (16)));
	};
	struct
	{
		vFloat vi[ kMyVectorSize/4];
		vFloat vj[ kMyVectorSize/4];
		vFloat vk[ kMyVectorSize/4];
	};
} VectorOfVertices;

This means grouping many vertices together in the same structure. This can cause its own problems in some cases. For example, common optimizations like just calculating points that fall in the view frustrum might have to be thrown out. You'll need to proceed judiciously here. On the other hand, it can often do wonderful things for your cache organization (c.f. Judy trees), if you can identify sets of vertices that "go together", for example, the set of vertices in an avatar's leg, or 5 consecutive amino acids in a protein. These are likely to be found near each other, and are therefore likely subject to similar sets of operations, so can usually be treated as a single unit.

In any case, once you have your {i,j,k} or {x,y,z} or what-have-yous in separate arrays, the vector arithmetic starts to look a lot like the scalar arithmetic done wider, and should speed up by approximately a factor of 4 on G5/Core 2.

	#include <Accelerate/Accelerate.h>

// Calculate the distance of 4 vertices from the origin
vFloat VectorLength( vFloat vi, vFloat vj, vFloat vk )
{
return vsqrtf( vi * vi + vj * vj + vk * vk );
}

or maybe like this for more than four vertices at a time (usually somewhat more efficient):


void VectorLength( restrict vFloat *results, const restrict vFloat *vi, const restrict vFloat *vj, const restrict vFloat *vk, int vec128Count )
{
int i;
for( i = 0; i < vec128Count; i++ )
results[i] = vi[i] * vi[i] + vj[i] * vj[i] + vk[i] * vk[i];


		i = vec128Count * 4;
		vvsqrtf( results, results, &i );
	}

< to be continued >
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/scitech/email@hidden

This email sent to email@hidden

________________________________________________________________ ////////////////// // LAMONTAGNE // GEOPHYSICS LTD ////////////////// GEOPHYSIQUE LTEE 115 Grant Timmins Dr. Kingston ON Canada K7M 8N3


_______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/email@hidden

This email sent to email@hidden
References: 
 >[apple scitech] Re: Accelerated Cartesian Vector Struct (2) (From: Ian Ollmann <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.