Re: 64-bit problem with VM statistics
Re: 64-bit problem with VM statistics
- Subject: Re: 64-bit problem with VM statistics
- From: Andrew Gallatin <email@hidden>
- Date: Mon, 6 Jun 2005 08:50:15 -0400 (EDT)
William Kucharski writes:
> On Jun 5, 2005, at 9:26 AM, Andrew Gallatin wrote:
>
> > This is not really true. Many PCI devices these days are capable of
> > addressing 64-bits of physical memory (eg, "DAC"). The DART enables
> > address limited devices to address memory anywhere, at the price of
> > artificially crippling DAC capable devices because of an oversight
> > on IBM or Apple's part.
>
> Though due to the PCI bus' 32-bit address width, DAC needs two cycles
> to specify the 64-bit address to be referenced, so depending upon
> locality it's not necessarily any faster for a DAC device to address
> memory than it is to do it via the DART or an IOMMU. One can easily
> craft scenarios in which each method of access is faster.
The time for an address cycle is of no consequence. I can show you
DMA benchmarks comparing alpha, ia64 and opteron (all DAC) vs G5
(DART) if you don't beleive me.
> > This is important to OS-bypass HPC network devices because it allows
> > a large amount of memory to be pinned and available for network
> > communication. Whereas on G5s (and IBM js20s), only 2GB of memory can
> > be pinned at any one time. Allowing for the needs of other devices,
> > this realistically means that only 1.5GB can be pinned at any one
> > time.
> > This is one of the handicaps the G5 platform faces in the HPC
> > cluster market.
>
> I wouldn't consider the G5's performance in HPC clusters such as
> Virginia Tech's
> to exactly be "handicapped." Every system has trade-offs between
> theoretical
Take a look at the efficiency (peak vs max), and you see the handicap.
DAC based machines can keep nearly all of RAM pinned, thereby avoiding
any interaction with the host OS kernel during network communication. Eg,
all the CPU cycles are spent on the application.
However, IOMMU based machines either have to copy to/from a pinned
region, or they keep pinning/unpinning memory when they need to
transfer arrays whose size exceeds the amount of pinnable memory on
the machine. Either way, CPU cycles are wasted either on copying, or
pinning/unpinning, either of which increase both latency and overhead.
This reduces the efficiency of the cluster.
Now, with all my whining and complaining aside, the IOMMU code in
tiger has gotten *much* better. With tiger, pinning memory is
nearly as fast as it is on a DAC machine. However, the address space
is still limited. There's nothing they can do about that.
Drew
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden