On Fri, 25 Nov 2005, Rustam Muginov wrote:
[...]
> 1. Use vec_msum onto Filter and Data vector (calculation one). This
> allows me to to make 8 multiplications and 4 additions with one
> instruction, and i am getting 32bit result for free.
> 2. Swap real and imaginary parts in one vector with vec_rl.
> 3. Invert signs on the odd element of the swapped vector (now to
> instuctions, vec_sub and vec_sel).
> 4. Use vec_msum again (calculation two). Again, 8 muls and 4 adds with
> one instruction.
>
Consider replacing step 2 and the vec_sel of step 3 with equivalent
permutes. If the loop is unrolled to exhibit enough instruction level
parallelism, the permutes can execute in parallel with the computational
instructions. This could increase throughput notably. However, especially
on G5, latencies will be increased by this. So unrolling is a necessary
requirement for this to be beneficial.
Holger
P.S.: The general rule here is to try and balance the work between permute
unit and the computational units. Ideally, the machine will approach
peak throughput of one permute plus one other vector instruction per
cycle.
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden
This email sent to email@hidden