Mailing Lists: Apple Mailing Lists
Image of Mac OS face in stamp
SSE - How to multiply a vector, then sum up its results
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SSE - How to multiply a vector, then sum up its results



I have another problem that was very easy and straightforward to solve
in Altivec:

I have two vectors with a row of 8 bit values, which I need to
multiply with each others (such as with a vector mult instruction),
and then add up all the resulting values (vector sum), finally get the
the result divided by 256 and write it to memory, then repeat.

Basically, I need to write a simple algorithm to scale a number of
bytes down to a smaller range, like when you'd reduce the size of a
grayscale picture made up of 8 bit values per pixel.

Since this is quite a common application (downscaling) I am surprised
there are no well-suited SSE instructions for this, or am I blind?

All I found is the mul-add function that performs a vector
multiplication as needed, but then adds only two adjacent values, not
all. If I have 8 input bytes, I'd only get down to a mul-sum of 4
results. I'd then still need a few shifts and adds to get the rest
summed up. And the result is that this is _slower_ than doing it in a
loop with non-SSE instructions.

Tests showed that if I could get rid of all the additional shifts and
adds, I'd be much faster, so it's not amemory-thruput issue but rather
one of too many instructions.

Here's the basic algo I've got so far:

	v = _mm_loadu_si128 (srcPtr);		// load up to 8 input bytes
	v = _mm_unpacklo_epi8 (v, zero);	// expand first 8 bytes into 16 bit words

	// multiply the words by their weights, and add up every two adjacent values
	v = _mm_madd_epi16(v, weights[x++]);

	// sum up the remaining 4 vector fields
	v = _mm_add_epi64(v, _mm_srli_si128(v,4));
	v = _mm_add_epi64(v, _mm_srli_si128(v,8));

	// get the sum and store it, rounded
	UInt16 v = _mm_extract_epi16 (v, 0);
	*dstPtr++ = (v+128)>>8;

Above you see that the code loads a few bytes from memory, then turns
them in to 16 bit values, then muls another vector which contains the
weights (raising the sum by a factor of 256) of each loaded byte, then
sums them all up and stores the result.

Any ideas how to make this faster in SSE2? Or in SSE3?

--
Thomas Tempelmann, http://www.tempel.org/
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2011 Apple Inc. All rights reserved.