Hello all.
I am using LoadUnaligned function described here:
http://developer.apple.com/hardware/ve/alignment.html
The function is pretty simple:
static vector unsigned char LoadUnaligned( unsigned char *target )
{
vector unsigned char MSQ, LSQ, result;
vector unsigned char mask;
MSQ = vec_ld(0, target); // most significant quadword
LSQ = vec_ld(15, target); // least significant quadword
mask = vec_lvsl(0, target); // create the permute mask
return vec_perm(MSQ, LSQ, mask); // align the data
}
Then iterating through the data in 128bit chunks, the third
instruction in the function seems an invariant for me. It would
always create the same permute mask then loaded at the 16bytes*n
offsets.
Should I manualy "unfold" this function in the cycle, i.e. create
permute mask
mask = vec_lvsl(0, target);
before cycle started, and do only three instructions inside the cycle:
MSQ = vec_ld(0, target);
LSQ = vec_ld(15, target);
return vec_perm(MSQ, LSQ, mask);
Or it is enough to declare this function as inline and compiler
would remove invariant from loop itself?
I am interested in the behaviour of both gcc3.3. and 4.0