
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] 
I have another problem that was very easy and straightforward to solve in Altivec: I have two vectors with a row of 8 bit values, which I need to multiply with each others (such as with a vector mult instruction), and then add up all the resulting values (vector sum), finally get the the result divided by 256 and write it to memory, then repeat. Basically, I need to write a simple algorithm to scale a number of bytes down to a smaller range, like when you'd reduce the size of a grayscale picture made up of 8 bit values per pixel. Since this is quite a common application (downscaling) I am surprised there are no wellsuited SSE instructions for this, or am I blind? All I found is the muladd function that performs a vector multiplication as needed, but then adds only two adjacent values, not all. If I have 8 input bytes, I'd only get down to a mulsum of 4 results. I'd then still need a few shifts and adds to get the rest summed up. And the result is that this is _slower_ than doing it in a loop with nonSSE instructions. Tests showed that if I could get rid of all the additional shifts and adds, I'd be much faster, so it's not amemorythruput issue but rather one of too many instructions. Here's the basic algo I've got so far: v = _mm_loadu_si128 (srcPtr); // load up to 8 input bytes v = _mm_unpacklo_epi8 (v, zero); // expand first 8 bytes into 16 bit words // multiply the words by their weights, and add up every two adjacent values v = _mm_madd_epi16(v, weights[x++]); // sum up the remaining 4 vector fields v = _mm_add_epi64(v, _mm_srli_si128(v,4)); v = _mm_add_epi64(v, _mm_srli_si128(v,8)); // get the sum and store it, rounded UInt16 v = _mm_extract_epi16 (v, 0); *dstPtr++ = (v+128)>>8; Above you see that the code loads a few bytes from memory, then turns them in to 16 bit values, then muls another vector which contains the weights (raising the sum by a factor of 256) of each loaded byte, then sums them all up and stores the result. Any ideas how to make this faster in SSE2? Or in SSE3?  Thomas Tempelmann, http://www.tempel.org/ _______________________________________________ Do not post admin requests to the list. They will be ignored. PerfOptimizationdev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: This email sent to email@hidden
Home  Archives  Terms/Conditions  Contact  RSS  Lists  About 
Visit the Apple Store online or at retail locations.
1800MYAPPLE
Contact Apple  Terms of Use  Privacy Policy
Copyright © 2011 Apple Inc. All rights reserved.