I'm not sure how to do that sum at the end nicely though. Maybe
somebody else has a suggestion.
No longer too sure where I got that, but I think it was in a piece of
assembly from Apple OpenGL team. I am not sure it qualifies as
"nicely", see the comments at the end.
so, assume diS contains the 4 values you want to add at the end of
those instructions:
the net result is a vector where all 4 components are the sum of the 4
original components. Variants of the same system can be used to sum
more dimensions, or less, you mostly have to get used to vsldoi.
However, the problem with this is again immediate dependancies for
sequential instructions. So it's best done in a larger block of code,
intertwined with other independant calculations.
If you have a larger set of dimensions to calculate and add, you can
postpone the above "mixing" until the end, by simply adding the
intermediate vectors together into an accumulator, then adding the
subelements of that accumulator only when done.