Then iterating through the data in 128bit chunks, the third
instruction in the function seems an invariant for me. It would
always create the same permute mask then loaded at the 16bytes*n
offsets.
Yep, as long as you're always incrementing the unaligned address
location by 16 * n bytes.
Should I manualy "unfold" this function in the cycle, i.e. create
permute mask
mask = vec_lvsl(0, target);
before cycle started, and do only three instructions inside the cycle:
If you're working in a loop then you can get it down to one load and
one permutation per loop iteration. For example:
void doWork(float *data, int dataLen)
{
vector float u0 = vec_ld(0, data);
vector unsigned char loadPerm = vec_lvsl(0, data);
int end;
int i;
end = dataLen - (dataLen & 3);
for (i = 0; i < end; i += 4) {
float *p = data + i;
vector float u1;
vector float v;
// each unaligned vector is constructed from two aligned vectors
u1 = vec_ld(16, p);
v = vec_perm(u0, u1, loadPerm);
// update previous aligned vector
u0 = u1;
// do work
// ...
}
// handle tail elements
for (i = end; i < dataLen; ++i) {
// do work
// ...
}
}
However, note that I use vec_ld(16, target) instead of vec_ld(15,
target) like Apple recommends. This is because in some cases, the
data that I want to load may actually be aligned and using vec_ld(15,
target) would load the same location as vec_ld(0, target), therefore
the u0 vector wouldn't be updated with the correct data.
If you can guarantee that your data *won't* be aligned then just use
vec_ld(15, target) as Apple does. If you're data might be aligned
then use vec_ld(16, target) but be aware that you might have problems
at the end of the array when you access data that you shouldn't do.
In this case just allocate an extra 16 bytes of data at the end of
the array. This is mentioned specifically in the documentation that
you referenced.
The above code can be similarly extended when unrolling the loop:
void doWork(float *data, int dataLen)
{
vector float u0 = vec_ld(0, data);
vector unsigned char loadPerm = vec_lvsl(0, data);
int end;
int i;
end = dataLen - (dataLen & 15);
for (i = 0; i < dataLen; i += 16) {
float *p = data + i;
vector float u1, u2, u3, u4;
vector float v1, v2, v3, v4;
// each unaligned vector is constructed from two aligned vectors
u1 = vec_ld(16, p);
u2 = vec_ld(32, p);
u3 = vec_ld(48, p);
u4 = vec_ld(64, p);
v1 = vec_perm(u0, u1, loadPerm);
v2 = vec_perm(u1, u2, loadPerm);
v3 = vec_perm(u2, u3, loadPerm);
v4 = vec_perm(u3, u4, loadPerm);
// update previous aligned vector
u0 = u4;
// do work
// ...
}
// handle tail elements
for (i = end; i < dataLen; ++i) {
// do work
// ...
}
}
Or it is enough to declare this function as inline and compiler
would remove invariant from loop itself?
I am interested in the behaviour of both gcc3.3. and 4.0
I don't think GCC is currently smart enough for this (since the
memory location in question is a runtime variable) but I might be
wrong. You should just compile the code and see what assembly GCC
spits out. You can look at an assembly listing by right clicking on
the code in Xcode (either in the file browser or the source code
listing) and selecting "Show Assembly Code".
You can also do this from the command line using:
gcc -S sourcecode.c
which will create a sourcecode.s file, or for compiled code:
objdump -tV filename
which will disassemble all the code in the executable.
I'm usually much to lazy to do any of those so I do a Shark profile
of the code (make sure you've got debugging symbols turned on).
Double click on the vec_lvsl instruction in the source listing in
Shark and it will take you to it's corresponding assembly code.
The people who made Shark should be given goddamn medals. Not just
for that but for a whole bunch of other stuff that's in it.
I would like to maximize performance in the tight loop, but code
readability is also and issue.
Thank you in advance.
If you work out a system for naming the variables and make sure you
don't make any typos then this sort of code shouldn't be too bad.
Otherwise, use preprocessor macros, e.g.:
#define LOAD(x, n) \
x ## n = vec_ld(16 * n, data)
#define PERM(x, a, b) \
x ## b = vec_perm(u ## a, u ## b, loadPerm)
and use:
LOAD(u, 1);
LOAD(u, 2);
LOAD(u, 3);
PERM(v, 0, 1);
PERM(v, 1, 2);
PERM(v, 2, 3);
which should translate to:
u1 = vec_ld(16, data);
u2 = vec_ld(32, data);
u3 = vec_ld(48, data);
v1 = vec_ld(u0, u1, loadPerm);
v2 = vec_ld(u1, u2, loadPerm);
v3 = vec_ld(u2, u3, loadPerm);
etc.