Re: 64-bit/shared memory performance
site_archiver@lists.apple.com Delivered-To: darwin-kernel@lists.apple.com I need to: You can do that too. Yes. -- Terry _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-kernel mailing list (Darwin-kernel@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-kernel/site_archiver%40lists.a... On Apr 9, 2008, at 12:22 AM, Jamil J. Weatherbee wrote: I have grown extremely frustrated with the shm_open()/mmap() interface. In order to work with a shared memory object greater than 2GB I have had to "chain" together multiple segments of an object created with shm_open(). Sure. I even gave you the radar number on it, an told you why the limit exists in the first place. Now, here is my dilemma. I am specifically trying to avoid a multithreaded client/server architecture as a very much (in theory at this point) lower performance alternative to this. This depends on the scheduling scope for the threads - whether the quantum is awarded only on a process basis, or if it's awarded on a thread basis. When we used this model (including descriptor and similar rfork/sfork context sharing between processes) on the NetWare for UNIX implementation, it was done that way because the SVR4. {0/0.2/1/2} scheduler implementation was very poor. Note that this is only ever an issue if you are in an environment where you are competing for quantum with processes that are not associated with your implementation, and have not modified your priority. So the idea that you need to go to multiple processes rather than multiple threads in a single process is wrong-headed 1) Create a LARGE contiguous shared memory area (i.e. 4GB+) which is intended to actually be entirely resident in the physical RAM of the machine. You can do this. You simply can't do it all at once, which is not that big an issue, since what's taking the time is establishing page mappings into the address map, not the creation of the address map itself. 2) Access the shared memory area for read only by many client hundreds of processes in a speedy fashion. Speedy meaning that at present I have to shm_open() and mmap() together segments of 1GB in order to produce one large 4GB object in the individual client processes. This works fine but as soon as I try and READ the object it appears that the kernel is doing some kind of "COPY ON READ" as you can see in "top" that it takes about 6 seconds for RSIZE to grow to VSIZE. I am unclear on why this is happening as I am no VM expert --- especially since I am using the MAP_FIXED directive in mmap() so that the addresses are all the same in every processes virtual address space --- actually this is what makes the chaining work. I can only imagine that since it is the VM system writing some kind of kernel mappings. Here is where you are going off into the weeds. The thing that is taking the time is to map the page entries into the virtual address space of the new process. You established a shared region by pointing at the same region in multiple processes, but thee is no explicit association of the backing pages to the new object until you fault them by attempting to access them from the new process. 3) What I need is the fastest, most overhead free manner of directly accessing a segment of shared memory from multiple processes. These processes are spawned and die often and quickly. Actually, they are the result of .php scripts execing a small compiled C program on the local machine. This would normally be the domain of a SQL server or the like but this type of data access cannot be efficiently handled inside of a standard RDBMS architecture. I need top performance so the data query must be written in C and must have read access to the data in memory (the 4GB of shared memory is actually only an "index" to the real data which exists in secondary storage). Any help/suggestions would be greatly appreciated. I suggest that you start your new processes from a master process that forks after having read-touched the pages to establish the mappings in the first place. Once your 6 seconds of read-faulting is done in the master process, any process forked off that process will get a reference on the map instead of a copy of the map, and will not need to fault itself to establish duplicate mappings. Alternately, as I previously suggested, you can use Mach primitives to pass around address space references for the region between processes (and avoid the mmap/shm interfaces altogether). This requires that you learn about how Mach does things, however, and there's not a huge amount of documentation on that, other than the source code, or Amit Singh's (very large) book. This email sent to site_archiver@lists.apple.com
participants (1)
-
Terry Lambert