Apple

On 20 Jul 2009, at 04:14, Esteban Bodigami wrote:

that's a thread from '08!

2009/7/19 Terry Lambert <email@hidden>

Why is this addressed as a response to me?

-- Terry

On Dec 25, 2008, at 3:00 AM, IainS <email@hidden> wrote:

Thanks Terry, for your helpful response,

Merry Xmas all round... but just before we head off to over-indulge..... I would welcome your comments on a couple more points.

On 25 Dec 2008, at 09:03, Terry Lambert wrote:

On Dec 24, 2008, at 3:05 PM, IainS <developer@sandoe-acoustics.co.uk> wrote:

On 24 Dec 2008, at 22:47, Terry Lambert wrote:

This can exceed total RAM plus swap because of memory overcommit, which has pretty much been SOP on unix since the DEC VAX version introduced virtual memory.

I believe that there are two distinct kinds of overcommit:

Let's define (for the sake of illustration) available = [ram + free swap - wired] ; overcommit as "anything > available".

Kind 1/ with multiple competing processes (**all of which which would individually fit within 'available'**).

A suitable and successful strategy is to serialize the access to the resources by making processes sleep when a memory demand cannot be met. The assumption is that processes will eventually complete and, although the system might be sluggish owing to swapping, it remains accessible. Some form of prioritization allows at least one process to proceed - such that a deadlock doesn't occur. Everyone's happy.

Kind 2/ Where the system allows a single process to overcommit.

In this case, assuming that the process uses the allocated memory, (and without a grim reaper watchdog process) this will eventually result in a locked system (the amount of time will depend on the offending process' priority).
(a) the offending process cannot complete - it will always end up sleeping whilst waiting for some unavailable blocks (which can never be made available, because the disk is full).
(b) most likely on the way to this the majority of other processes will be swapped out - and cannot be swapped back in because there is no more backing available for the RAM.
(c) in the end all processes that rely on allocation of memory will be asleep - whilst the system has not "technically" crashed - it is actually in a vegetative state and we might as well switch off the life support...
(d) anything that needed space on the root partition (including all system and user databases in a default installation) has probably lost data.

The problem with this second scenario is that you've just described the halting problem by putting forth the implied question of how such processes can be recognized. Will the process complete? We have no idea. Maybe a timer will fire and release resources. We can't know this merely because there us an outstanding unfired timer somewhere in the system, because there is no cause/effect coupling annotating the timer when it was enqueued. There are dozens of high probability events like our putative timer, and the number of actual events that might release resources goes up as the probability of them being coupled to implied resource reservations goes down.

Unfortunately, computing hasn't been about Djikstra's algorithm and "buy a big enough machine" (one with sufficient resources for all contingencies) since at least 1978.

They mostly did it by having a 32 bit address space, twice as much preconmitted swap as physical RAM.

Indeed, I remember this well as a rule-of-thumb. Although, actually, the problem does affect older OSX 32 bit machines with smaller disks - if the same criteria are met (insufficient backing).

Older OSX had this issue, as many modern OSs, including Windows, now do, because swap was not preallocated, and disk space for it was used as another form of overcommit: if I'm not going to use that much swap, why should I permanently lose 64M of disk space over it?

This is more or less market-driven, just like the changes in filesystem technology to eliminate the free reserve of 15% in UFS, which was there to avoid disk fragmentation by reducing the hash-fill to Knuth's 85% (cylinder group allocation in UFS was effectively a hash operation). People want to be able to "use what they paid for", without having to understand that they _were_ using the space -- just not for storage of their data.

So overcommit.

If that didn't work, then thr approach which was commonly used was to kill programs in order to recover resources.

the problem we have here is that it's frequently the case that the system has become so sleep-bound that you cannot obtain an interface to do this - and in any event, there's no automatic way to recover - and a typical non-command-line User is not going to be able to succeed.

Not really. The system killed the process as a result of tripping into the fault handler in the pager. Kernel resources are (with her rare exceptions) never overcommitted, and so the fault handler and kill code gets to run. AIX did this. Later it added the ability (via signal registration) to mark some processes as "precious", which it would avoid killing if possible, to prevent what would otherwise turn into a reboot.

What you'd really like to do is kill the process that caused the shortage in the first place. But is that the ray tracing application taking up 95% of the resources that has been running two weeks, or is it the 6%-using web browser you fired up to surf news while waiting the last 20 minutes for the ray tracer to complete?

It's even possible that Activity Monitor would fail to quit the process since it might need memory to pop up the "force-quit, quit, cancel" dialogue.

We're not talking about a user-controlled kill, which has the problems you note, unless you take all the needed code between the mouse and the kill system call put of the overcommit realm. We are talking something a lot more brutal and low level, gurarnteed to keep the system viable. But maybe at the expense of two weeks of ray tracing due to be sent out for green screen edit in the next 3 days.

You can also opt a program out of overcommit several ways, but you typically have to replace the malloc, or not use it. Most of these boil down to forcing swap to be preconmitted by dirtying the pages or locking them into physical memory. This is usually inconvenient for both the programmer. Also the other processes sharing the system have to do without those resources, even if they are not being actively used by the greedy precommit process.

hm. I wouldn't ask to set aside resources for this - the basis of my thesis here is that there is no point allowing a process to (actually) overcommit (I accept that virtual overcommit is useful) -- somehow one has to draw the line when paging space runs out.

This assumes it's a measurably finite resource. What happens when you see the system headed towards thus (or in the middle of it), and plug in another disk to help out? Your line just moved.

I suggest filing a problem report.

I'll do this and copy the radar number here

BTW: in the case that a kernel extension is stupid enough not to wire its memory ... this can also cause a panic (but since the panic is probably down to a 3rd party extension I would not expect it receive much attention).

It pretty much has zero choice in the matter, the memory comes to it wired.

(a) Is malloc() supposed to honor ulimits?

(b) Is malloc() supposed to allow requests in excess of available system resources?

Is there any official answer to these?

This isn't an official support channel. To get an official answer, you'd need to file a problem report.

Here are my answers, though:

(a) No. Currently the limits are voluntary. For example, gcc uses getrlimit() to look at the limits and voluntarily limits its working set size by choosing not to exceed the limits.

Hard limits can be set by system admin (in my admin days 'normal' users had to request single-user access to machines if they wanted to do stuff like this).
Soft voluntary limits would actually solve my problems - so long as malloc honors them.

It's up to the caller of malloc() to honor them. The malloc() call itself is a user space library call that has no idea of how much total virtual address space is availble (disk+RAM; a poor measure anyway, since you are competing with other processes, and disk can fill for other reasons). In addition, it uses Mach calls to allocate anonymous pageable memory, which means it would bypass any resource limits established in the BSD layer in any case (setrlimit).

A possible strategy to mitigate this is to precommit swap space for all anonymous physical allocations and all shared COW allocations on a per-process basis, "just in case", but if we do that, we are in the boat of telling people to buy bigger disks for their expected workloads, and the previously mentioned sparse utilization algorithms are out in the cold, unless you potentially have terabytes of disk available.

It seems to me that it is easier to just fix the broken software causing the problem.

(b) Yes. Allocations that it makes are for virtual, not physical, resources. Virtual resources are effectively near-infinite, and you could be using an algorithm that has sparse utilization of the virtual resources. If so, then there's no reason your algorithm shouldn't be allowed to work, due to an arbitrary administrative limit enforced against a virtual resource.

Your argument is good - and I agree it's challenging - but one ought to be able to find a way to "bail out" in the case that a process does not make sparse use of the resource. Perhaps an overcommitted process could be marked as "do bus error, instead of sleeping when backing runs out" (which is similar to one of your suggestions, too).

Thus conflicts with intentionally large utilization by an important process getting shot in the foot by an unimportant small utilization pushing it over the top (e.g. some fool firing up a browser on a machine in a render farm because they are too lazy to walk back to their office),

Typically these are one user or single purpose machines.

hm .. I run Mail, iTunes, and numerous other background processes which all suffer database corruption if the system blocks this way. OK, it's not a DEC20 timesharing .. but., "single user" is stretching things a bit these days ;-)

Yes, but an application in that environment that eats 2^40 bits of virtual address space on you is either broken or not intended to be run with the rest of your workload at the same time.

Back in my admin days, it was also possible to do this on multiuser machines, and our typical reaction was to disable the offending users account. For a single user machine, you could simply reinstall the offending software and contact the vendor for a fix.

I can honestly say that this never happened AFAIK on the Vax, Primes and Suns we used (and I've worked in R&D all my life with people persistently trying wacky things :-)
; normal user accounts had restrictive limits. (I do not say it's impossible, just that I don't believe I ever saw it happen ).

We tended to do it on purpose, at least on role-based systems. The way you tune a traditional UNIX is to (1) remove all administrative limits, (2) load it up til it crashes, (3) set the limits just below that.

For a general purpose (shell account) machine, you (1) set ridiculously low limits, (2) played BOFH if anyone complained.

Neither one's a good option for a desktop machine.

If it cannot be regarded as a bug, perhaps honoring ulimits or providing RLIMIT_VMEM would be a very useful enhancement?

You can file a problem report, but we are unlikely to add an RLIMIT_VMEM for the same reason top can't give you an exact answer on who is responsible for shared pages: cardinality hides this information, even from the OS itself.

perhaps then, the desideratum is unachievable (preventing a single processes from over-committing).

Limiting your processes utilization of the available (by virtue of addressable bits) virtual address space with RLIMIT_DATA is much more likely, but that would be either voluntary or up to its parent process to set before the program is started.

as I said above, admin can set hard limits for RLIMIT_XXXX (I know there's no absolute guarantee - but one can make things considerably more robust). If you choose to raise the limits for a single process or user -- least you then enter the territory with eyes-wide-open (and Mail and iTunes and anything else with a db shut down... )

Admins do this by controlling the limits for the processes after fork and before deescalting privileges to do the exec. You'd likely be unhappy with the results in a GUI environment, for a lot of reasons. Gamers would outright hate it.

For a responsible user testing code which is intended to stress the system, voluntary limits are entirely satisfactory -- providing they are honored by malloc.

have a great Xmas,
Iain

Merry Christmas!
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)

This email sent to email@hidden

--
Esteban Giuseppe Bodigami Vincenzi

References:
	>Re: what is expected malloc behavior re. limits/insufficient resources? (From: Terry Lambert <email@hidden>)
	>Re: what is expected malloc behavior re. limits/insufficient resources? (From: Esteban Bodigami <email@hidden>)
	>Re: what is expected malloc behavior re. limits/insufficient resources? (From: IainS <email@hidden>)