site_archiver@lists.apple.com Delivered-To: darwin-dev@lists.apple.com -- Terry _______________________________________________ Do not post admin requests to the list. They will be ignored. Darwin-dev mailing list (Darwin-dev@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/darwin-dev/site_archiver%40lists.appl... On Mar 9, 2008, at 10:31 AM, Jonas Maebe wrote: On 28 Jan 2008, at 04:23, Jordan K. Hubbard wrote: On Jan 27, 2008, at 2:46 PM, Jonas Maebe wrote: That's more or less true for Linux, but not for Mac OS X at least up till 10.4.x (I haven't benchmarked on 10.5 yet). Compiling our compiler with itself, which involves about 173 (v)fork+execs from a single compiler run to assemble&link all the files, is 20% to 25% slower with fork instead of vfork on a G4, and 35% to 40% on a G5 (32 bit processes in both cases) on 10.4.x. And for clarity: this is relative to the entire time needed for compiling+assembling +linking everything (on the G5: 24 vs 15 seconds), not some academic mbench-like speed difference between the fork and vforks. It would be interesting to benchmark this in 10.5 as well, given a number of changes to the relevant code I finally got around to benchmarking this again. All tests below are under 10.5.2, compiling our compiler with itself. In all cases these are "native" compilations (i.e., an i386 compiler compiling an i386 compiler, an x86_64 compiler compiling an x86_64 compiler etc), and the assembler gets its input via a pipe. When the compiler is told not to assembler/link, it generates a shell script with a the necessary calls to the assembler and linker to assemble/link everything. The time needed to complete this script is what is timed in the second item for each case below. The fork(2) is expected to be rather slow, going up with the complexity of the address map for the application. Typical MacOS X has a lot more address space mappings allocated per process, both in frame buffer/video data and in system libraries within the shared segment, and these end up getting duplicated and filled out if you fork(2), then thrown away when you execve(2). So the cost you are measuring is the cost to duplicate the address space mappings of the child in the parent process, then go throw away the address space mappings in the child when you replace the currently executing image with a new image via the execve(2). This is basically a conscious trade-off to make runtime less expensive at a penalty to fork(2) duplication of address space mappings, but it means that the degenerate case of fork(2) immediately followed by execve(2) ends up slower than on other systems. The vfork(2) call was explicitly added to COW implementations of process address space overcommit for exactly this reason, way back when (3.0 BSD): address space setup and teardown is expensive, and if you do it for no reason, it's going to show on your benchmarks, if that is what they measure (as opposed to what you may think they measure). Typical MacOS X applications will spend most of their time in user space in CPU intensive code, or they will spend most of their time in the kernel, blocked on an I/O channel waiting for slow disks or other hardware to answer their request. Either way, at that point this overhead is very much lost in the noise, and so is not worth optimizing compared to other, lower-hanging fruit. Unless you are modifying process state after the vfork(2) before the execve(2), then vfork(2) is likely your best bet for a quick fix. If you are resetting privileges, opening/closing files, etc., then posix_spawn(2) is likely your best bet. Technically, the system is permitted to give "Undefined behaviour" if you call any system calls other than _exit(2) or execve(2) subsequent to calling vfork(2), so don't do that, if you want your code to keep working in future releases, use posix_spawn(2) instead. This email sent to site_archiver@lists.apple.com