On Tue, Sep 14, 2004 at 06:40:32PM +0300, Kyros Yakinthos scratched on
the wall:
Is it finally worth to program using ALtiVec in a FORTRAN code by
calling C subroutines?
As a professional software engineer, my answer would be "no," but
there is a lot in that answer-- I might not even be answering the
question you're asking.
If you are asking about using existing libraries or frameworks, such
as Apple's Accelerate framework (which contains vector optimized
versions of vDSP, vImage, BLAS, LAPACK, vMathLib, and BigNumb)
then I would say, "yes!!!" If any of Apple's libraries does what you
need, it is likely worth the trouble to stub the libs out in C to
FORTRAN and/or link against them. They have both vectorized and
non-vectorized versions of most of the calls, so they'll run on G3s
as well as G4s and G5s as required. You really don't need to think
or care about if the call you are making is vectorized-- you can just
be reasonably confident that the library will get whatever you want
done as fast as it can given the current processor and datatypes--
including future hardware. No need to re-invent the wheel.
If you are looking at auto-vectorization tools, I would say a much
less enthusiastic "yes," or even just a "maybe." If the tools aren't
real expensive, they are worth giving it a shot, but you should
understand what you have (or don't have) to gain so you can look at
the cost and the result and see what works for you. For many these
tools are a gift from the heavens; for others they only offer
disappointment.
On the other hand, if you are asking about hand-coding operations on
the vector unit (I assume you are), I would seriously question this
practice. Vector programming is very tricky. It is not a simple
"array processor"... how you pack your data into vector units is very
critical to performance and you have to understand a lot of the low
level details to wrap your mind around how the vector unit was
designed to be used. Even if you're writing in C or FORTRAN you need
to *think* in individual assembly instructions; that also means
knowing
your tools and systems will enough to know how and why specific
program statements are compiled into machine instructions. Doing
this kind of thing well is extremely difficult, just as hand-coding
instructions for the G5s dual FPUs would be extremely difficult. I
would never attempt it without the PowerPC-970 Instruction Reference
Manual on the desk next to me. If you've never looked at a processor
reference manual, save yourself and don't start. Most CS undergrads
have never looked at one (although most CE or EE undergrads have!).
For the bench-scientists, researcher, and/or engineer, this kind of
very low-level mucking about is very very rarely worth the effort. I
assume most of the people on this list are scientists or engineers
first and computer programmers second. This is a good thing
(actually, it isn't. I'd rather you guys were computer programmers
"tenth" or some larger number, but that's a different story). The
computer is simply a means to an end, not an obsession in itself.
Spend your time doing good research, not fighting compilers.
Faster code may lead to faster and better research, but consider
this. The ideal vector code will, at best, give you 2x the
performance
over the ideal non-vector code on the G5 (assuming single-precision
floating point; double-precision can't be vectorized; best-case
integer performance may be higher). One could also make a strong
argument that it is easier to write "good" non-vector code than it is
to write "good" vector code, effectively making that 2x even smaller.
If all you want is 2x performance, go buy another machine. It is
likely to be much cheaper than the people-time to make the code
faster by hand-vectorizing it. Even if that requires rewriting
sections to allow distributed computing, this is time spent that is
more worth the effort. At least distributed versions typically scale
past two.
OK, I'll admit that "buy more machines" isn't an option if you
already have a 1000 node cluster since another 1000 machines will
pay for a *lot* of programming time (I'll trade you!). On those
kinds of scales, it is an individual call.
Everyone's situation is different, and there are times when
cost/performance is outweighed by raw performance. Just understand
the high costs of this kind of work, and the rather slim results even
if you do a great job. That said, there is no reason not to take
advantage of it if you can-- the Accelerate libs from Apple make that
easy and can reduce a lot of other programming work. They'd be
highly desirable even if they weren't vectorized. Throw in the IBM
compilers, which are fairly inexpensive next to the programming time
they can save, and you're fairly well off. But tweaking the vector
pipeline by hand is high wizardry.
-j