Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: AltiVec (small example)



 I have been recoding some wavelet image processing techniques and have seen
speed-ups of 5-10x by using veclib.

One comment:  In many cases, achieving this requires looking at how the
algorithm is implemented it and restructuring it to effectively use the
altivec.  For example, if you are doing orthogonal wavelet transforms on the
rows and columns of an image, its faster to do the rows, transpose the
image, do the columns ( which are now the rows ), and transpose again.  This
keeps the stride at 1 allowing the altivec to work.

Also, version 3.5 of gcc supposedly does automatic vectorization, at least
in some cases.  This will be part of the next release of Xcode, but is
available now from open source sites.


-----Original Message-----
From: scitech-bounces+chris.bevis=email@hidden
[mailto:scitech-bounces+chris.bevis=email@hidden] On
Behalf Of Kyros Yakinthos
Sent: Friday, September 17, 2004 12:20 AM
To: Apple Scitech Mailing List
Subject: Re: AltiVec (small example)



On 16 Σεπ 2004, at 17:54 , Brendan Younger wrote:

>
> // NOTE: I typed this directly into my mail client; it should work, 
> but I haven't tested it.
> void my_vec_add(float *a, float *b, float *c, int *n) {
> 	int i, length = *n;
>
> 	for(i=0; i < length ; i += 16) {
> 		vector float	vec_a, vec_b, vec_sum, vec_c;
>
> 		vec_a = vec_ld(i, a);
> 		vec_b = vec_ld(i, b);
>
> 		vec_sum = vec_add(vec_a, vec_b);
> 		vec_c = vec_add(vec_sum, vec_sum);
>
> 		vec_st(vec_c, i, c);
> 	}
> }
>
> The relevant changes are:
> 1. Using vec_ld() explicitly rather than dereferencing the pointers.  
> At least on earlier versions of GCC, I've found that this helps the 
> compiler to unroll the loop and optimize a bit more.
> 2. Using only two vec_add()'s.  In performance-critical code, you 
> never, ever want to use a suboptimal algorithm.
> 3. Miscellaneous improvements like taking out the "inline" and the 
> #define's, neither of which are necessary or particularly good form.
>
>

OK, this version of the C subroutine provided by Brendan
showed enormous speedups!
I watched even a x7 speedup in my XSreve G5.
Same behavior also for my G4 (using always xlf).
I think, the critical point was indeed the vec_ld( ) instruction.
As a FORTRAN programmer, I will use this C-subroutine structure
as a baseline to construct my various calls for simple operations.
Thank you Brendan,

Kyros


 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/scitech/email@hidden

This email sent to email@hidden
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Scitech mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/scitech/email@hidden

This email sent to email@hidden



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.