Re: 64-bit problem with VM statistics

6 Jun 2005

      site_archiver@lists.apple.com
Delivered-To: darwin-kernel@lists.apple.com

William Kucharski writes:
...
On Jun 5, 2005, at 9:26 AM, Andrew Gallatin wrote:
...
This is not really true.  Many PCI devices these days are capable of
addressing 64-bits of physical memory (eg, "DAC").  The DART enables
address limited devices to address memory anywhere, at the price of
artificially crippling DAC capable devices because of an oversight
on IBM or Apple's part.
Though due to the PCI bus' 32-bit address width, DAC needs two cycles
to specify the 64-bit address to be referenced, so depending upon
locality it's not necessarily any faster for a DAC device to address
memory than it is to do it via the DART or an IOMMU.  One can easily
craft scenarios in which each method of access is faster.
The time for an address cycle is of no consequence.  I can show you
DMA benchmarks comparing alpha, ia64 and opteron (all DAC) vs G5
(DART) if you don't beleive me.
...
...
This is important to OS-bypass HPC network devices because it allows
a large amount of memory to be pinned and available for network
communication.  Whereas on G5s (and IBM js20s), only 2GB of memory can
be pinned at any one time.  Allowing for the needs of other devices,
this realistically means that only 1.5GB can be pinned at any one
time.
This is one of the handicaps the G5 platform faces in the HPC
cluster market.
I wouldn't consider the G5's performance in HPC clusters such as
Virginia Tech's
to exactly be "handicapped."  Every system has trade-offs between
theoretical
Take a look at the efficiency (peak vs max), and you see the handicap.
DAC based machines can keep nearly all of RAM pinned, thereby avoiding
any interaction with the host OS kernel during network communication.  Eg,
all the CPU cycles are spent on the application.

However, IOMMU based machines either have to copy to/from a pinned
region, or they keep pinning/unpinning memory when they need to
transfer arrays whose size exceeds the amount of pinnable memory on
the machine.  Either way, CPU cycles are wasted either on copying, or
pinning/unpinning, either of which increase both latency and overhead.
This reduces the efficiency of the cluster.

Now, with all my whining and complaining aside, the IOMMU code in
tiger has gotten *much* better.    With tiger, pinning memory is
nearly as fast as it is on a DAC machine.   However, the address space
is still limited.   There's nothing they can do about that.

Drew
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (Darwin-kernel@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a...

This email sent to site_archiver@lists.apple.com

Andrew Gallatin

tags

participants (1)