Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VBO not faster than immediate mode?



I tried Time Profile (All Thread States) and the results are quite different:

31.0% 31.0% libSystem.B.dylib __semwait_signal
 0.0% 31.0% libSystem.B.dylib _pthread_cond_wait
   0.0% 31.0% libSystem.B.dylib  pthread_cond_wait$UNIX2003

 0.0% 15.5% libGLProgrammability.dylib   glvmDoWork
 0.0% 15.5% libSystem.B.dylib    _pthread_start
 0.0% 15.5% libSystem.B.dylib     thread_start

 0.0% 15.5% QuartzCore   fe_fragment_thread
 0.0% 15.5% libSystem.B.dylib    _pthread_start
 0.0% 15.5% libSystem.B.dylib     thread_start

28.3% 28.3% libSystem.B.dylib mach_msg_trap
0.0% 28.3% libSystem.B.dylib mach_msg
0.0% 15.5% OpenGL  glcDebugListener
0.0% 7.7% CoreFoundation  CFRunLoopRunSpecific
0.0% 2.0% CoreGraphics  _CGSGetPortStreamInline
0.0% 1.3% libSystem.B.dylib  mach_port_get_set_status
20.4% 20.4% iParticle updateSimulation
11.1% 11.1% libSystem.B.dylib thread_start
2.8% 2.8% libSystem.B.dylib pow$fenv_access_off
2.3% 2.3% libSystem.B.dylib __bsdthread_create

This with VBO's and 200k particles and approx. 45-60 fps. 

Well, few things are messy and I am aware of it. I have timer for redraws. After redraw (=update and draw VBO buffers) I simply spawn 4 threads for calculations hoping they finish before next redraw. That was a quick hack but everything seems to work without garbage on screen. 

Anyway  it seems that I am wasting also 11% for those thread spawnings (which I suspected) so I need to design little better system for those with my limited (read non-existing) threading knowledge.

So what else should I read from the Shark results? 
There seems to be whopping 60% of time where something is blocking. Probably both; the VBO update and my kludged threading model?

AFAIK, glcDebugListener exists only when you're attached to the app with OpenGL Profiler, to service requests from it.  You ideally want it to show up as blocking, as you see above, 100% of the time; anything less is the overhead of the profiling.

fe_fragment_thread and glvmDoWork are Quartz and GL threads, who's purposes I'm not entirely sure of since I'm not familiar with them, but I believe they're basically worker threads that process work as it's given to them; that they're spending all their time blocking in pthread_cond_wait most likely simply indicates they're idle, so they're not the bottleneck, though it may indicate you're underutilising the system - you'll have to await a response from someone who knows these more intimately as to what you should be aiming to see from them.

Thread spawning for each frame, if this is intended to have anything approaching real-time framerates, is very very expensive, as you've noted.  What you should do, as I'm sure you know, is spawn your N worker threads (you can get the actual number of active cores on the machine using sysctl and hw.logicalcpu) and simply having them wait on a condition to do work.  That's a first step.  A better approach would be to decouple them further from your rendering by having them run continuously, servicing work from a central queue and simply periodically "dropping off" the results for the main thread to render from.

The central queue, rather than fixed, preallocated divisions between threads, is to counteract any imbalance between work on the threads; it's unlikely that all your worker threads will do exactly the same amount of work, and even then still unlikely they'll finish together (given CPU time taken by other apps, the semi-random effects of L2 eviction, etc).  Even if you just do the first step of not respawning your threads each frame, you should look into how balanced they are - System Trace (in Shark) is a good way to see that.  It'll also help you with locking bottlenecks you'll likely hit when you rejig your worker threads.

[[ Also, if you expect your main thread to be busy a lot of the time, e.g. drawing, then test out spawning N-1 workers instead; there's a trade-off between CPU utilisation re. throughput and contention of having N+1 active threads, and when you consider that on any given system there's usually other stuff going on at least a little, it often turns out better to slightly underload the system with your own app. ]]

In general, it can be tricky sometimes to figure out what blocking is good and what blocking is bad.  Generally assume blocking is good if you see it on threads other than the main one, that you don't know any better about.  For your own threads, you should have a reasonable idea what their intended behaviour is.  Blocking from runloops is almost always "fine" in that it indicates you're sitting idle most of the time, which is the typical ideal pattern for runloops.  Blocking in locking calls, that aren't pthread_cond_wait, is usually bad; it indicates contention on the locks that's causing you to block and thus waste time.  Again, use System Trace if you see that sort of thing to see what's happening in greater detail.  And lastly, blocking on the main thread in any GUI app is almost always bad, because if nothing else it's likely to translate into a poor user experience by making your app less responsive.

For programs with many threads it's often tricky to look at them all at one in TP (ATS); personally I find it easier much of the time to look at them thread by thread, mainly just my threads since in most apps the system libraries spawn umpteen of their own, which you needn't care about.  Doing that, what you want to see in your particular app is that your worker threads are doing productive work as much of the time as possible, and that your main thread is as close to idle as possible, with what time it isn't idle being spent in drawing code (since ideally your drawing, which is ultimately all "overhead" in a productivity sense, is infinitely fast).

Also, on a general OpenGL note, in my experience - which I thought was due to some fault of my own, but the 6800 benchmarks posted by Michael mimic my own findings - VBOs are very very difficult to actually get good performance out of, let alone this uber-performance people always claim for them.  They demand very specific data types (which afaik isn't documented, so I hope you enjoy trial and error), they hate being modified after being created, and they don't scale - too many individual VBOs and performance dives, and if they get too large your app crashes.  Out of immediate mode, display lists, vertex arrays and VBOs, I find vertex arrays to have the best performance for the kind of drawing you're doing, pretty much across the board.  [[ I've never tested VARs, though ]]

Also - last time, I swear ;) - as someone else mentioned, you should check out what your app is doing using OpenGL Driver Monitor, to see if there's an obvious bottleneck in your usage - e.g. too data being sent across the bus.  Since I'm not an OpenGL guy so much, I find it even easier to just do a Processor Bandwidth profile in Shark, which'll just generally show if you're bandwidth limited, including of course the memory bandwidth your simulation processing is using.

Wade
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Mac-opengl mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/mac-opengl/email@hidden

This email sent to email@hidden

References: 
 >VBO not faster than immediate mode? (From: Mike <email@hidden>)
 >Re: VBO not faster than immediate mode? (From: "Bob Free" <email@hidden>)
 >Re: VBO not faster than immediate mode? (From: Mike <email@hidden>)
 >Re: VBO not faster than immediate mode? (From: Mike <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.