|[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]|
I have another problem that was very easy and straightforward to solve in Altivec: I have two vectors with a row of 8 bit values, which I need to multiply with each others (such as with a vector mult instruction), and then add up all the resulting values (vector sum), finally get the the result divided by 256 and write it to memory, then repeat. Basically, I need to write a simple algorithm to scale a number of bytes down to a smaller range, like when you'd reduce the size of a grayscale picture made up of 8 bit values per pixel. Since this is quite a common application (downscaling) I am surprised there are no well-suited SSE instructions for this, or am I blind? All I found is the mul-add function that performs a vector multiplication as needed, but then adds only two adjacent values, not all. If I have 8 input bytes, I'd only get down to a mul-sum of 4 results. I'd then still need a few shifts and adds to get the rest summed up. And the result is that this is _slower_ than doing it in a loop with non-SSE instructions. Tests showed that if I could get rid of all the additional shifts and adds, I'd be much faster, so it's not amemory-thruput issue but rather one of too many instructions. Here's the basic algo I've got so far: v = _mm_loadu_si128 (srcPtr); // load up to 8 input bytes v = _mm_unpacklo_epi8 (v, zero); // expand first 8 bytes into 16 bit words // multiply the words by their weights, and add up every two adjacent values v = _mm_madd_epi16(v, weights[x++]); // sum up the remaining 4 vector fields v = _mm_add_epi64(v, _mm_srli_si128(v,4)); v = _mm_add_epi64(v, _mm_srli_si128(v,8)); // get the sum and store it, rounded UInt16 v = _mm_extract_epi16 (v, 0); *dstPtr++ = (v+128)>>8; Above you see that the code loads a few bytes from memory, then turns them in to 16 bit values, then muls another vector which contains the weights (raising the sum by a factor of 256) of each loaded byte, then sums them all up and stores the result. Any ideas how to make this faster in SSE2? Or in SSE3? -- Thomas Tempelmann, http://www.tempel.org/ _______________________________________________ Do not post admin requests to the list. They will be ignored. PerfOptimization-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: This email sent to email@hidden
Visit the Apple Store online or at retail locations.
Copyright © 2011 Apple Inc. All rights reserved.