Re: 64-bit/shared memory performance

9 Apr 2008

      site_archiver@lists.apple.com
Delivered-To: darwin-kernel@lists.apple.com

I need to:

You can do that too.

Yes.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list      (Darwin-kernel@lists.apple.com)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a...
On Apr 9, 2008, at 12:22 AM, Jamil J. Weatherbee wrote:

I have grown extremely frustrated with the shm_open()/mmap()
interface.
In order to work with a shared memory object greater than 2GB I have
had to "chain" together multiple segments of an object created with
shm_open().

Sure.  I even gave you the radar number on it, an told you why the
limit exists in the first place.
Now, here is my dilemma.  I am specifically trying to avoid a
multithreaded client/server architecture as a very much (in theory
at this point) lower performance alternative to this.

This depends on the scheduling scope for the threads - whether the
quantum is awarded only on a process basis, or if it's awarded on a
thread basis.  When we used this model (including descriptor and
similar rfork/sfork context sharing between processes) on the NetWare
for UNIX implementation, it was done that way because the SVR4.
{0/0.2/1/2} scheduler implementation was very poor.  Note that this is
only ever an issue if you are in an environment where you are
competing for quantum with processes that are not associated with your
implementation, and have not modified your priority.
So the idea that you need to go to multiple processes rather than
multiple threads in a single process is wrong-headed
1) Create a LARGE contiguous shared memory area (i.e. 4GB+) which is
intended to actually be entirely resident in the physical RAM of the
machine.

You can do this.  You simply can't do it all at once, which is not
that big an issue, since what's taking the time is establishing page
mappings into the address map, not the creation of the address map
itself.
2) Access the shared memory area for read only by many client
hundreds of processes in a speedy fashion.

   Speedy meaning that at present I have to shm_open() and mmap()
together segments of 1GB in order to produce one large 4GB object in
the individual client processes.  This works fine

   but as soon as I try and READ the object it appears that the
kernel is doing some kind of "COPY ON READ" as you can see in "top"
that it takes about 6 seconds for RSIZE to grow to VSIZE.

   I am unclear on why this is happening as I am no VM expert  ---
especially since I am using the MAP_FIXED directive in mmap() so
that the addresses are all the same in every processes

   virtual address space --- actually this is what makes the
chaining work.  I can only imagine that since it is the VM system
writing some kind of kernel mappings.

Here is where you are going off into the weeds.  The thing that is
taking the time is to map the page entries into the virtual address
space of the new process.  You established a shared region by pointing
at the same region in multiple processes, but thee is no explicit
association of the backing pages to the new object until you fault
them by attempting to access them from the new process.
3) What I need is the fastest, most overhead free manner of directly
accessing a segment of shared memory from multiple processes.  These
processes are spawned and die often and quickly.   Actually, they
are the result of .php scripts execing a small compiled C program on
the local machine.   This would normally be the domain of a SQL
server or the like but this type of data access cannot be
efficiently handled inside of a standard RDBMS architecture.  I need
top performance so the data query must be written in C and must have
read access to the data in memory (the 4GB of shared memory is
actually only an "index" to the real data which exists in secondary
storage).
Any help/suggestions would be greatly appreciated.

I suggest that you start your new processes from a master process that
forks after having read-touched the pages to establish the mappings in
the first place.  Once your 6 seconds of read-faulting is done in the
master process, any process forked off that process will get a
reference on the map instead of a copy of the map, and will not need
to fault itself to establish duplicate mappings.
Alternately, as I previously suggested, you can use Mach primitives to
pass around address space references for the region between processes
(and avoid the mmap/shm interfaces altogether).  This requires that
you learn about how Mach does things, however, and there's not a huge
amount of documentation on that, other than the source code, or Amit
Singh's (very large) book.
This email sent to site_archiver@lists.apple.com

Terry Lambert

tags

participants (1)