We've been investigating some performance problems with an application
that we're working on and have found that memory performance -
allocations & deallocations above 15 kb - is extremely poor in a
multi-threaded environment - one thread per CPU. The problem appears
to be that when doing work that requires a lot of memory allocation
and deallocation that threads will become blocked as a result of Zero
Fill VM Faults.
We use a pool of threads (one thread per CPU) where each thread is
given a load of work to do. This unit of work is independent from any
other thread and can be completed without blocking other threads -
i.e. there should be no resource contention. There is, however, a
large amount of memory usage required to complete each unit of work.
This is where the problem arises. Using Shark we found that a large
amount of time was being spent blocking because of Zero Fill VM
Faults. An All Thread States Time Profile showed an unusually large
amount of time being spent in memcpy while another System Trace
Profile showed that most of the time attributed to memcpy was a result
of Zero Fill VM Faults. This results in a 15-20% performance hit for
our program on the Mac.
On other platforms (Windows & Linux) the same hardware (8 Core Mac
Pro) running our program can fully saturate each of the 8 CPUs - top
usage of around 785-790% and almost 0% idle. On the Mac (10.5.2)
however, we can only get about 680% CPU usage with about 14% idle.
The problem appears to be a direct result of blocking caused by Zero
Fill VM Faults.
I've written a small test program that illustrates the problem and
shows that it can actually be far worse that what we are seeing in our
application. Basically all it does is launch a thread for each CPU
that does the following:
char * pData;
while ( true )
{
pData = (char *)malloc( nSize );
memset( pData, 5, nSize );
free( pData );
}
It's basically the worst case scenario that you could give to the VM
system. All it does is repeatedly allocate a block of memory, memset
it to make sure it is actually touched and then free it. If nSize is
14 KB (1024 * 14) or less then this little test app will get full CPU
utilization - near 800%. This is because memory allocations within
this size range are done using a heap - the memory is only Zero Filled
once. When memory is returned to the heap and then reallocated it is
not Zero Filled again since this has already been done. The problem
arises when nSize is set to 15 KB or greater. When using larger
memory sizes CPU utilization drops from near 800% to around 230%.
Yes, barely over 2 CPUs can be kept busy when doing lots of memory
allocations greater than 15 KB. This seems to be very bad when trying
to scale performance on machines with lots of CPUs. On a 4 CPU Mac
Pro this app will only get to about 180% in top. So going from 4 to 8
CPUs only gives you a change of 180% to 230%, that's clearly not
scaling well and not even coming close to reaching each machine's
potential.
The problem according to what I've read is that large memory
allocations (15 KB & up) are done using vm_allocate which will always
Zero Fill the new memory. This seems very bad to me. Why always Zero
Fill? Why not only Zero Fill when the actually memory page was last
used by another process? If the pages were last used by the
requesting process then there's no reason to Zero Fill them.
Another interesting observation that I've found is that there also
appears to be some throttling going on - at least that's the best way
I can explain it. If I modify my test app to only create one thread
and then run 8 instances of the application then CPU utilization
increases from 230% to 620%. This shows that the VM system is clearly
capable of working faster, but is limiting itself when it is stressed
by a single process. It's still not maxing out all 8 CPUs but it is
clearly capable of performing better when more applications are using
memory than if one application with multiple threads is accessing memory.
The only way that I can see to work around this is for us to create
our own memory allocator so that when we free memory up we don't
return it to the VM system. That really seems like overkill. Does
anyone else have any suggestions? I'll certainly be filing this as a
bug report but in the interim it certainly appears that applications
designed to scale well across large numbers of CPUs will not scale
well on the Mac as long as they require frequent memory allocation and
deallocation.
Thanks!
--
Dave Thorup
Software Engineer
http://bibblelabs.com
_______________________________________________
Do not post admin requests to the list. They will be ignored.
PerfOptimization-dev mailing list
(email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden