Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fastest way to change sign of the odd elements of vSInt16



Thank you for your input, Holger, but i think that in this particular case it would not be wise to "separate" interleaved data into uniform vectors.
The code I am optimizing doing the following:


result.re += pFilter->re * pData->re + pFilter->im * pData->im; //(calculation one)
result.im += pFilter->re * pData->im - pFilter->im * pData->re; // (calculation two)

Real and imaginary parts are interleaved 16-bit, the result is 32 bit.
Currently, i am using the following approach:
1. Use vec_msum onto Filter and Data vector (calculation one). This allows me to to make 8 multiplications and 4 additions with one instruction, and i am getting 32bit result for free.
2. Swap real and imaginary parts in one vector with vec_rl.
3. Invert signs on the odd element of the swapped vector (now to instuctions, vec_sub and vec_sel).
4. Use vec_msum again (calculation two). Again, 8 muls and 4 adds with one instruction.


I think this is much more efficient then to first create uniform vectors, then deal with vec_mule/vec_mulo and separate addition.

vec_msum fits wonderful to the algorithm, after I figured out how to swap re/im parts and change signs.


-- Sincerely, Rustam Muginov

On Nov 25, 2005, at 4:57 PM, Holger Bettag wrote:

On Fri, 25 Nov 2005, Rustam Muginov wrote:

Thank you :)
I realy went into overcomplicated way.
Subtraction from zero is the way to go.

Depending on the degree of utilization of the vector permute unit, there
might be a way to increase overall throughput. This can only work if there
is more work being done in addition to the sign change.


1. take two complex input vectors and reorder the data into one real
   vector and one imaginary vector with two permutes

2. flip the sign of the imaginary vector with subtraction

3. restore the original order of data with another two permutes

This still takes at least two clock cycles per vector (limited by permute
now), and uses 2.5 instructions per input vector. BUT in case the permute
unit was idle, you now have managed to offload real work to it, and you
gained three issue slots for further computational instructions.


  Holger

P.S.: If you can keep imaginary and real components in separate vectors
      over the course of more computation, you might see even more
      performance improvement.


_______________________________________________ Do not post admin requests to the list. They will be ignored. PerfOptimization-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Fastest way to change sign of the odd elements of vSInt16 (From: Rustam Muginov <email@hidden>)
 >Re: Fastest way to change sign of the odd elements of vSInt16 (From: Paul Russell <email@hidden>)
 >Re: Fastest way to change sign of the odd elements of vSInt16 (From: Rustam Muginov <email@hidden>)
 >Re: Fastest way to change sign of the odd elements of vSInt16 (From: Holger Bettag <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.