Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fastest way to change sign of the odd elements of vSInt16



Holger, thank you for your excellent (as usual) comments.
Indeed, to convert vector:
// d0.re, d0.im, d1.re, d1.im, d2.re, d2.im, d3.re, d3.im
into vector:
// d0.im, -d0.re, d1.im, -d1.re, d2.im, -d2.re, d3.im, -d3.re
i need only to instructions, one vec_sub and one vec_perm.
Plus this instructions run on separate units!

This is the permute mask I am using:
  // input vector 0:
  //  0001 0203 0405 0607 0809 0A0B 0C0D 0E0F
  // |----+----+----+----+----+----+----+----|
  //   0re  0im  1re  1im  2re  2im  3re  3im

  // input vector 1 (negated):
  //  1011 1213 1415 1617 1819 1A1B 1C1D 1E1F
  // |----+----+----+----+----+----+----+----|
  //  -0re -0im -1re -1im -2re -2im -3re -3im

  //  permute mask and the result of permute
  //  0203 1011 0607 1415 0A0B 1819 0E0F 1C1D
  // |----+----+----+----+----+----+----+----|
  //   0im -0re  1im -1re  2im -2re  3im -3re

Thank you again for your nice ideas. -- Sincerely, Rustam Muginov

On Nov 25, 2005, at 6:19 PM, Holger Bettag wrote:

On Fri, 25 Nov 2005, Rustam Muginov wrote:

[...]
1. Use vec_msum onto Filter and Data vector (calculation one). This
allows me to to make 8 multiplications and 4 additions with one
instruction, and i am getting 32bit result for free.
2. Swap real and imaginary parts in one vector with vec_rl.
3. Invert signs on the odd element of the swapped vector (now to
instuctions, vec_sub and vec_sel).
4. Use vec_msum again (calculation two). Again, 8 muls and 4 adds with
one instruction.

Consider replacing step 2 and the vec_sel of step 3 with equivalent
permutes. If the loop is unrolled to exhibit enough instruction level
parallelism, the permutes can execute in parallel with the computational
instructions. This could increase throughput notably. However, especially
on G5, latencies will be increased by this. So unrolling is a necessary
requirement for this to be beneficial.


  Holger

P.S.: The general rule here is to try and balance the work between permute
unit and the computational units. Ideally, the machine will approach
peak throughput of one permute plus one other vector instruction per
cycle.
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/ email@hidden


This email sent to email@hidden


_______________________________________________ Do not post admin requests to the list. They will be ignored. PerfOptimization-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Fastest way to change sign of the odd elements of vSInt16 (From: Rustam Muginov <email@hidden>)
 >Re: Fastest way to change sign of the odd elements of vSInt16 (From: Paul Russell <email@hidden>)
 >Re: Fastest way to change sign of the odd elements of vSInt16 (From: Rustam Muginov <email@hidden>)
 >Re: Fastest way to change sign of the odd elements of vSInt16 (From: Holger Bettag <email@hidden>)
 >Re: Fastest way to change sign of the odd elements of vSInt16 (From: Rustam Muginov <email@hidden>)
 >Re: Fastest way to change sign of the odd elements of vSInt16 (From: Holger Bettag <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.