Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Strange behaviour of mmap() in OS X



[macos-x-dev removed from x-post list, edited for length]

[Jim]
[ I wrote]:
Note that other UNIXes I tested (Linux, IRIX) seem to handle mmap() more gracefully and the system remains fully usable during and after the mmap(). The amount of swapping is minimal.

There shouldn't really be any swap-out at all.

Any pageout algorithm that automatically punishes mapped file pages (simply because they don't have to be cleaned before recycling in many cases) is inherently unfare.

You seem to be making the assumption here that "fairness" is an inherent good. I feel no inherent moral imperative for treating memory pages fairly. I do see an inherent *practical* requirement for a VM system to be stable and predictable without catastrophic failures.

This looks to me like chasing some very theoretical goal and neglecting practical matters, but I could be mistaken.

[mapped pages are already 'punished']

In theory, maybe. In practice, the system will push everything else out of RAM onto disk in favor of the mapped file.

So, without explicit intervention/hinting, I wholeheartedly disagree with your assertion about "there shouldn't really be any swapping."

While I wholeheartedly disagree with your disagreement, I will settle for a compromise: not going into thrash-death would be a sufficient start ;-)

If the next likely page to replace needs to be swapped in order to bring in the next page needed from a mapped file, we should do the swapping. Of course, if there is hinting...we should be able to do a better job of choosing a page to replace (see below).

That is a microscopic view of the situation. The macroscopic view is that you need to have a policy that avoids such catastrophic failures, just like other operating systems.

[..]
The BSD madvise() API is mostly forward-looking (and all future effects on the memory as a result of the call are optional/advisory only). So, even if it does nothing, it's not, technically, broken.

At the time I filed the bug, the APIs were, in fact, also technically broken, because they would cause kernel panics / lockups in the types of high-traffic situations I was forcing. But let's not split hairs over this: the macroscopic behavior is broken, as can be easily observed.

And we do more than nothing when these are called.

Maybe now. At that time, the calls were simply not hooked up. (see Radar 2757124, madvise( MADV_DONTNEED, MADV_WILLNEED) not hooked up -- which is still open, by the way)

It's just that we not be aggressive as some other vendors. But we had some reason to our madness:

Whatever the reasons, they lead to catastrophic results and should therefore be reviewed...

Specifically, in your case, marking something as "sequential" still doesn't give us quite enough information to avoid doing "really stupid" things in some cases.

Sorry, but even the "really stupid" things you might do are better than the *really bad* things the system does currently. From radar 2757117:

-- Expected Results:

Performance should degrade smoothly as real memory is exceeded, down to the performance of the disk. If the working set stabilizes, performance should pick up again. If the new memory is allocated/used at a constant rate but older memory is unused (working-set stays roughly constant), performance should be roughly the same as disk speed.


-- Actual Results:

Performance is good as long as real memory isn't exceeded, but breaks down catastrophically once real memory is exceeded. The paging rate is 1MB/s or lower where the disk is capable of 7MB/s or more (see workaround below). Overall performance degrades significantly because the system apparently makes less than optimal decisions about what to swap out.


That is, how far back from the current "fault" should we start deactivating pages? If it's too far back, we still spill over the available memory and at least have to swap some.

(a) better than the current situation, which is swapping out all of RAM
(b) better than forcing application programmers to make this decision

If it's not far enough back, the application could re-access the page right after we deactivated it.

Not optimal, but vastly superior to the current situation, which is catastrophic failure.

That's because while pre-fetching in the sequential case, we rarely are working in the VM system with the same page the thread is currently working on. We could add some artificial overhead

Huh? You are doing the pre-fetching, so you *should* know how much you are pre-fetching.

to cause faults and detect where the thread "is". But that's assuming a straight [single] linear progression through the pages. With vector code, etc, you often make several passes through small[ish] data chunks and then move on. So, "sequential" isn't all that meaningful these days.

Huh? The assumption you seem to be making here is that with sequential, we need to know exactly when a page is 'done' so we can then immediately discard it. Why? All that is necessary is to know that when memory is tight, you start discarding sequential pages from the start of the sequential region. No clairvoyance required!

However, you have placed precisely this clairvoyance-burden on apps, because they do not get notified when memory needs to be freed, and there is not even a way to reliably specify "take that memory first in this order". Effectively, the application program has to implement its own VM system.

Instead, we decided to rely on specific "past-looking" feedback. The msync(MS_INVALIDATE) call can be used to inform us which pages you absolutely don't need anymore. It's very similar to madvise(MADV_DONTNEED).

Sorry, that is just manual management of memory. From the radar report:

-- Workaround:

Using a mmap()ed file as backing store. With the MAP_SHARED option and the addition of calls to vm_msync() with the SYNCHRONIZE option (a) yields consistent performance of 7MB/s on a system where that is the disk speed and (b) doesn't impact the rest of the system.

However, that method kills the benefits one should have from the VM system, because operations that could operate all in memory are now also limited to disk speed.

-- Isolation:

Various options for advising the VM-subsystem, for example with vm_behaviour_set() or madvise() have no discernible effect. Trying to achieve a similar, but softer, effect than SYNCHRONIZE with DEACTIVATE causes a hang (see other submitted bug).

But it has much more predictable scheduling behavior in the face of actually finding dirty pages in the range. That is: whose thread is used to do the cleaning to make the pages available? Many assume the madvise() call will return somewhat immediately, and the cleaning of any pages will happen asynchronously.

Since the call is called madvise(), and not m_force_purge(), that is a pretty good assumption to make.

But that asynchronous behavior can affect the rest of your media scheduling in unpredictable ways.

I am not doing any media-scheduling, and it seems to me that if media-scheduling has special requirements, then it should get special APIs/flags, instead of subverting the rest of the system to a point where it breaks.

Marcel

--
Marcel Weiher Metaobject Software Technologies
email@hidden www.metaobject.com
Metaprogramming for the Graphic Arts. HOM, IDEAs, MetaAd etc.
1d480c25f397c4786386135f8e8938e4
_______________________________________________
darwin-development mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/darwin-development
Do not post admin requests to the list. They will be ignored.

References: 
 >Re: Strange behaviour of mmap() in OS X (From: Jim Magee <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.