Mailing Lists: Apple Mailing Lists
Image of Mac OS face in stamp
Re: MPI on SMP system
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: MPI on SMP system



Even after you optimize your code to the point where it has perfect scaling (with number of CPUs) and even if your communication was perfect, the Apple G5 architecture (not the CPU) was simply not designed for high-end, memory intensive computations. There is absolutely no reason for a single memory controller and path. Toms Hardware did a nice review of the AMD Opteron boards and it is extremely clear that the G5 is hamstrung by the sharing (by both CPUs) of the memory path and controller. Apple is aware of this. I optimized my code. They looked at it. Everyone in between looked at it. It has nothing to do with the OpenMP/MPI. Its all hardware.

I wish Apple built single-CPU XServe G5 cluster nodes. At least you would get 100% of each CPU. Getting 50-60% of the 2nd CPU is not good. Fortunately the G5 is an extremely good processor (for my calculations). Even with the bad memory access design, its faster than the Opteron and Itanium2, again for my computations. I ordered a 16-CPU XServe G5 anyway.

I don't want to start a A vs B debate. These opinions are based on my experience with a variety of large memory codes on a number of platforms all running some Unix variant (Linux, Irix, OS-X, Unicos, Ultrix).

Unfortunately one cannot compare the G5 bandwidth to SGI boxes. The SGI CPUs aren't great but you'd be hard pressed to find a better motherboard. SGI has always paid attention to the issues involved in multi-processor systems.

I think the main problem is that Apple does not want to develop another motherboard. I believe, but may be wrong, that they have 3 basic motherboards. The G5 and XServe are essentially the same. Hopefully they will decide to get into high-end computing and provide a motherboard that the PPC-970 deserves.


Sean


On Mar 18, 2004, at 7:37 AM, Gaurav Khanna wrote:

Sean

Thanks for your input. I have a follow-up question or two. Would
this issue persist if one had used OpenMP parallelization? I would
think so .. I have not extensively tested that out yet.

Also, benchmarking my codes on an SGI ALTIX (which is an Itanium2
based SMP system) .. yielded near perfect scaling. So, I imagine that
the SGI ALTIX motherboard has a memory controller etc for each
processor or something like that.

Gaurav




===== Original Message From "Sean C. Garrick" <email@hidden> =====
I've done a fair amount of benchmarking with my own MPI codes on
different machines have noticed the same thing. This is a known issue.
It has to do with the limited memory bandwidth on the Apple G5
motherboard. Each processor has to share the bus and memory controller
which accesses main memory.

This is true on Xeon motherboards and their the performance is even
worse. It is not true on AMD Opteron motherboards however. There I get
speed-ups of 1.85 or so. And just so everyone knows, this is executed
using "mpirun -np 1 or 2 executable.out"

I was hoping that Apple would give each processor its own memory
controller on the XServe G5 motherboard but alas it did not happen.

Sean



On Mar 17, 2004, at 9:04 PM, Gaurav Khanna wrote:

Hi all

I have a question on the performance of MPI (MPICH,
in this case) on an SMP (shared memory) system. I have
a code that I, MPI parallelized myself .. and it scales near
perfectly (on small sized distributed clusters). More
explicitly, say I run this code on 2 single processor Macs,
in parallel, I nearly get twice the speed compared with one.

However, if I run the same code on a dual processor
Mac (G4 or G5) by configuring MPI to treat the machine
as 2 computers .. I get much poorer performance (i.e.
gain in speed over single processor). Moreover, the larger
(in terms of memory) simulation I attempt, the worse
the problem gets. On a run using approximately 300MB
of RAM, I'm down to getting a factor of 1.5 speed-up
using a dual processor over a single processor.

I even tried to reconfigure and recompile MPICH for shared
memory communication (using -comm=shared) but no
improvement.

I tried a totally different and unrelated code (that is also
known to scale well) and I'm getting pretty much the same
deal. I even (very briefly though) tried LAM-MPI with no
significant difference.

Am I missing something? Has anyone noticed this as well?
Note that the problem becomes significant only for *large*
simulations .. say 300MB or more. Any advice would be
appreciated. Maybe this is a generic occurance when you
use MPI on an SMP machine .. instead of OpenMP?

I'll try a similar test on an IBM Power4 system (p690) that
I have access to ..

Regards
Gaurav
_______________________________________________
scitech mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/scitech
Do not post admin requests to the list. They will be ignored.
_______________________________________________
scitech mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/scitech
Do not post admin requests to the list. They will be ignored.
_______________________________________________
clusters mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/clusters
Do not post admin requests to the list. They will be ignored.


References: 
 >RE: MPI on SMP system (From: Gaurav Khanna <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2011 Apple Inc. All rights reserved.