Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Hotspot in Accelerate Framework




On Jan 2, 2005, at 12:04 PM, email@hidden wrote:


Interesting ... ATL_srot_xp0yp0aXbX is the ATLAS BLAS fallback case for non-unit stride along X or Y. This case is *not* aggressively optimized because it almost certain to be *memory* bound. Shark shows nearly 25% of the total run time consumed by loading operands from memory in this routine. How are you calling sgells_? (I'm trying to understand where the non-unit strides are introduced.)

SCP
--
Steve Peters
CoreOS Performance Enginerring
Apple Computer
email@hidden


I've done a little shark profiling on my application. The main core of
it is using LAPACK in the accelerate framework to solve a large set of
simultaneous equations with sgells_. There appears to be one function
(libBLAS.dylib ATL_srot_xp0yp0aXbX) which uses 43% of my run time. I
took a quick look at it and it seems to be a loop which calculates
a*b-c*d and a*d-c*b where a and c are constants over the loop, and
appears very stall heavy on my machine (a G4+, looks mainly like data
dependancies). Switch Shark to a G5 and it appears to get worse (bigger
stalls, which probably negate the two fp-ops per dispatch group bonus).

Is there anything I can do about this? (probably not because it's in a
framework)
It looks like it could be unrolled reasonably easily. Could I write a
new version and then override the dynamic linker to point at my new
version?
Is there any way of knowing if my code (actually, the LAPACK code) is
calling this with large loop counts or just very often? (to know if
unrolling would be worthwhile)

Thanks

Paul

mr. r3,r3 1:1 ! Inline
beqlr 1:1
mtctr r3 *2:2
0.0% slwi r7,r7,2 1:1
slwi r5,r5,2 1:1
2.3% lfs f13,0(r4) 4:1 ! Stall=2, Unaligned loop start
22.5% lfs f0,0(r6) 4:1 Stall=2
6.0% fmuls f12,f2,f13 5:1 Stall=3
6.3% fmuls f11,f2,f0 5:1 Stall=3
1.6% fmsubs f0,f1,f0,f12 5:1 Stall=3
2.0% fmadds f13,f1,f13,f11 5:1 Stall=2
1.8% stfs f0,0(r6) 3:3
add r6,r6,r7 1:1
0.6% stfs f13,0(r4) 3:3
add r4,r4,r5 1:1
bdnz $-40 1:1 Loop end[1]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/enriched
Size: 1648 bytes
Desc: not available
Url : http://lists.apple.com/pipermail/perfoptimization-dev/attachments/ 20050101/4f650296/attachment.bin


------------------------------

_______________________________________________
PerfOptimization-dev mailing list
email@hidden
http://lists.apple.com/mailman/listinfo/perfoptimization-dev


End of PerfOptimization-dev Digest, Vol 5, Issue 1 **************************************************





_______________________________________________ Do not post admin requests to the list. They will be ignored. PerfOptimization-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden


Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.