Thank you for a good catch, Holger.
I had implemented such aproach.
I created for partial sums, zeroed them cycle loop.
After a cycle, i am adding partial sums in the following way:
vSum = vec_add( vec_add( vSum0, vSum1 ), vec_add( vSum2, vSum3 ) );
But now modified function gives different results, comparing with
original one.
Sometimes the relative error is rather high, like 11% or so.
I think i did something wrong with shifting to partial sums, but i can
not guess what.
This way, the operations become independent. You add the partial sums
only
after the loop, so you suffer from the stalls only once, rather than
every
iteration.