Mailing Lists: Apple Mailing Lists
Image of Mac OS face in stamp
SSE - How to multiply a vector, then sum up its results
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SSE - How to multiply a vector, then sum up its results

I have another problem that was very easy and straightforward to solve
in Altivec:

I have two vectors with a row of 8 bit values, which I need to
multiply with each others (such as with a vector mult instruction),
and then add up all the resulting values (vector sum), finally get the
the result divided by 256 and write it to memory, then repeat.

Basically, I need to write a simple algorithm to scale a number of
bytes down to a smaller range, like when you'd reduce the size of a
grayscale picture made up of 8 bit values per pixel.

Since this is quite a common application (downscaling) I am surprised
there are no well-suited SSE instructions for this, or am I blind?

All I found is the mul-add function that performs a vector
multiplication as needed, but then adds only two adjacent values, not
all. If I have 8 input bytes, I'd only get down to a mul-sum of 4
results. I'd then still need a few shifts and adds to get the rest
summed up. And the result is that this is _slower_ than doing it in a
loop with non-SSE instructions.

Tests showed that if I could get rid of all the additional shifts and
adds, I'd be much faster, so it's not amemory-thruput issue but rather
one of too many instructions.

Here's the basic algo I've got so far:

	v = _mm_loadu_si128 (srcPtr);		// load up to 8 input bytes
	v = _mm_unpacklo_epi8 (v, zero);	// expand first 8 bytes into 16 bit words

	// multiply the words by their weights, and add up every two adjacent values
	v = _mm_madd_epi16(v, weights[x++]);

	// sum up the remaining 4 vector fields
	v = _mm_add_epi64(v, _mm_srli_si128(v,4));
	v = _mm_add_epi64(v, _mm_srli_si128(v,8));

	// get the sum and store it, rounded
	UInt16 v = _mm_extract_epi16 (v, 0);
	*dstPtr++ = (v+128)>>8;

Above you see that the code loads a few bytes from memory, then turns
them in to 16 bit values, then muls another vector which contains the
weights (raising the sum by a factor of 256) of each loaded byte, then
sums them all up and stores the result.

Any ideas how to make this faster in SSE2? Or in SSE3?

Thomas Tempelmann,
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Visit the Apple Store online or at retail locations.

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2011 Apple Inc. All rights reserved.