Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: L2 Cache Miss (was: Floating Point comparison G5)




Hi Ian,

  Thanks once again.

. It looks like we are running on a G4 here?

Compile with -g if Shark isn't showing which lines of source this loop correspond to.


Yes you were correct I was profiling it on my laptop last night (that's very impressive by the way that you picked that up). I've rerun the code and performed the L2 cache miss profile on my G5 here at work with debugging symbols on. Since this is really the real world case, I'd image that the output here is more appropriate:


205 * Copy the current "row" into the work space.
206 V = V0
207 IW = IWN
208
0.0% 0.0% 209 DO K = 1, M
98.5% 97.8% 210 WORK( IW ) = REAL(DATA( V )) ! Data cache block touch, Unaligned loop start
0.6% 0.6% 211 WORK( IW + 1 ) = AIMAG(DATA( V ))
212 V = V + INC
213 IW = IW + 2
214 END DO


  And the assembly:

0.0% 0.0% 0x523a8 cmpwi r25,0 3:1 Stall=2, Loop start[23] mFFT.f:209
0x523ac ble $+264 <mFFT_$.clon0 + 4308> 2:1 mFFT.f:209
0x523b0 slwi r3,r23,3 2:1 mFFT.f:209
0x523b4 lwz r4,300(r1) 3:1 Stall=1 mFFT.f:209
0x523b8 lwz r2,284(r1) 3:1 mFFT.f:209
0x523bc add r3,r3,r4 2:1 mFFT.f:209
0x523c0 beq cr4,$+44 <mFFT_$.clon0 + 4108>2:1 mFFT.f:209
0x523c4 lwz r5,260(r1) 3:1 Stall=1 mFFT.f:209
0x523c8 li r4,260 2:1 mFFT.f:210
0x523cc mtctr r5 3:1 mFFT.f:210
0x523d0 lfsux f0,r3,r22 5:1 ! Stall=3, Loop start[24], Unaligned loop start mFFT.f:210
0x523d4 lfs f1,4(r3) 5:1 Stall=2 mFFT.f:210
0x523d8 stfs f0,4(r2) 4:1 mFFT.f:210
0x523dc dcbt r2,r4 3:1 ! Data cache block touch mFFT.f:210
0x523e0 stfsu f1,8(r2) 4:1 mFFT.f:211
0x523e4 bdnz++ $-20 <mFFT_$.clon0 + 4080> 2:1 Loop end[24] mFFT.f:214
0x523e8 beq cr2,$+204 <mFFT_$.clon0 + 4308> 2:1 mFFT.f:214
0x523ec lfsx f0,r3,r22 5:1 mFFT.f:210
0x523f0 add r3,r3,r22 2:1 mFFT.f:210
0x523f4 lwz r4,248(r1) 3:1 mFFT.f:210
0x523f8 lfs f1,4(r3) 5:1 mFFT.f:210
0.3% 0.3% 0x523fc lfsx f5,r3,r22 5:1 mFFT.f:210
0.0% 0.0% 0x52400 add r3,r3,r22 2:1 mFFT.f:210
0.1% 0.1% 0x52404 mtctr r4 3:1 mFFT.f:210
0x52408 lfs f2,4(r3) 5:1 mFFT.f:210
0x5240c lfsx f3,r3,r22 5:1 mFFT.f:210
0x52410 add r3,r3,r22 2:1 mFFT.f:210
0x52414 bdz-- $+104 <mFFT_$.clon0 + 4252>2:1 mFFT.f:210
0.1% 0.1% 0x52418 li r6,260 2:1 mFFT.f:210
0x5241c nop 0:0 mFFT.f:210
19.5% 19.4% 0x52420 lfs f4,4(r3) 5:1 Loop start[25] mFFT.f:210
0x52424 lfsx f6,r3,r22 5:1 mFFT.f:210
0.2% 0.2% 0x52428 stfs f5,12(r2) 4:1 mFFT.f:210
0x5242c add r3,r3,r22 2:1 Stall=1 mFFT.f:210
17.8% 17.7% 0x52430 lfs f5,4(r3) 5:1 mFFT.f:210
0x52434 add r4,r3,r22 2:1 mFFT.f:210
0x52438 stfs f0,4(r2) 4:1 mFFT.f:210
0x5243c stfs f3,20(r2) 4:1 mFFT.f:210
0.1% 0.1% 0x52440 stfs f2,16(r2) 4:1 mFFT.f:211
0x52444 lfsx f0,r3,r22 5:1 mFFT.f:210
0.2% 0.2% 0x52448 stfs f1,8(r2) 4:1 mFFT.f:211
0x5244c add r5,r4,r22 2:1 mFFT.f:210
23.0% 22.8% 0x52450 dcbt r2,r6 3:1 ! Data cache block touch mFFT.f:210
0x52454 stfs f4,24(r2) 4:1 mFFT.f:211
0.0% 0.0% 0x52458 stfs f5,32(r2) 4:1 mFFT.f:211
0x5245c lfs f1,4(r4) 5:1 mFFT.f:210
1.2% 1.2% 0x52460 stfs f6,28(r2) 4:1 mFFT.f:210
0x52464 addi r2,r2,32 2:1 mFFT.f:211
0x52468 lfs f2,4(r5) 5:1 mFFT.f:210
0x5246c add r3,r5,r22 2:1 mFFT.f:210
16.0% 15.9% 0x52470 lfsx f3,r5,r22 5:1 mFFT.f:210
0x52474 lfsx f5,r4,r22 5:1 mFFT.f:210
0.0% 0.0% 0x52478 bdnz++ $-88 <mFFT_$.clon0 + 4160> 2:1 Loop end[25] mFFT.f:210
0.1% 0.1% 0x5247c lfs f4,4(r3) 5:1 mFFT.f:210
0.0% 0.0% 0x52480 lfsx f6,r3,r22 5:1 mFFT.f:210
0x52484 stfs f5,12(r2) 4:1 mFFT.f:210
0x52488 add r3,r3,r22 2:1 Stall=1 mFFT.f:210
0x5248c lfs f5,4(r3) 5:1 mFFT.f:210
0.1% 0.1% 0x52490 stfs f2,16(r2) 4:1 mFFT.f:211
0x52494 li r4,260 2:1 mFFT.f:210
0x52498 stfs f3,20(r2) 4:1 mFFT.f:210
0x5249c stfs f0,4(r2) 4:1 mFFT.f:210
0x524a0 stfs f1,8(r2) 4:1 mFFT.f:211
0x524a4 stfs f5,32(r2) 4:1 mFFT.f:211
0x524a8 stfs f4,24(r2) 4:1 mFFT.f:211
0x524ac dcbt r2,r4 3:1 ! Data cache block touch mFFT.f:210
0x524b0 stfs f6,28(r2) 4:1 mFFT.f:210
Note that I've compiled with -qprefetch on in XLF (although I seem similar dcbt instructions without it). This only occurs in the third dimension of the FFT (which presumably involves pulling in the most data. Since vDSP essentially eliminates this problem please don't spend more than a cursory amount of time looking at it. A similar problem occurs right after this FFT where some values are being multiplied etc (on the FFT transformed data set). But in looking at that code I think I could vectorize that to compensate. So once I get that in place and profile it, perhaps we could look at that instead, if any issues arise.


Once I again, I am very grateful your help with this. It's been very informative!

  Regards,

Dave

_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/perfoptimization-dev/email@hidden

This email sent to email@hidden
References: 
 >Floating Point comparison G5 vs. Opteron (64-bit) question (From: David Gohara <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Marco Scheurer <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Ian Ollmann <email@hidden>)
 >Re: Floating Point comparison G5 vs. Opteron (64-bit) question (From: Ian Ollmann <email@hidden>)
 >Re: L2 Cache Miss (was: Floating Point comparison G5) (From: David Gohara <email@hidden>)
 >Re: L2 Cache Miss (was: Floating Point comparison G5) (From: Ian Ollmann <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.