Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Should I "manualy unfold" LoadUnaligned" function?



Hello all.
I am using LoadUnaligned function described here:
http://developer.apple.com/hardware/ve/alignment.html
The function is pretty simple:

static vector unsigned char LoadUnaligned( unsigned char *target )
{
  vector unsigned char MSQ, LSQ, result;
  vector unsigned char mask;

  MSQ = vec_ld(0, target); // most significant quadword
  LSQ = vec_ld(15, target); // least significant quadword

  mask = vec_lvsl(0, target); // create the permute mask
  return vec_perm(MSQ, LSQ, mask); // align the data
}

Then iterating through the data in 128bit chunks, the third instruction in the function seems an invariant for me. It would always create the same permute mask then loaded at the 16bytes*n offsets.


Should I manualy "unfold" this function in the cycle, i.e. create permute mask
mask = vec_lvsl(0, target);
before cycle started, and do only three instructions inside the cycle:
  MSQ = vec_ld(0, target);
  LSQ = vec_ld(15, target);
  return vec_perm(MSQ, LSQ, mask);

Or it is enough to declare this function as inline and compiler would remove invariant from loop itself?
I am interested in the behaviour of both gcc3.3. and 4.0
I would like to maximize performance in the tight loop, but code readability is also and issue.
Thank you in advance.


--
Sincerely,
	Rustam Muginov

_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden


Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.