For the edification of the list members, I have checked out both
cblas_saxpy and vSaxpy.
Good points of cblas_saxpy: It appears to have code to support the G3,
and its inner loop seems to be 6 opcodes per vector float, instead of
vSaxpy's 7.
Good points of vSaxpy: Its per-call overhead appears to be very low,
whereas cblas_saxpy is pretty heavy. cblas_saxpy goes through a dylib
stub, does some input parameter checking to determine whether it should
use Altivec or not, and eventually calls another function to get the
real work done. vSaxpy just gets right to work, no fuss, no muss. When
you only need to do work on 256 elements, the setup cost adds up.