Re: kernel paging (was Hard-limits for kern.maxproc)
Re: kernel paging (was Hard-limits for kern.maxproc)
- Subject: Re: kernel paging (was Hard-limits for kern.maxproc)
- From: Terry Lambert <email@hidden>
- Date: Sat, 07 Feb 2009 14:15:05 -0800
On Feb 7, 2009, at 3:34 AM, Steve Checkoway <email@hidden> wrote:
On Feb 7, 2009, at 3:14 AM, Terry Lambert wrote:
On Feb 7, 2009, at 12:39 AM, Steve Checkoway <email@hidden> wrote:
Thank you Terry, Michael, and Michael. That's all very
interesting! Kernel design decisions are always fascinating.
Just to be clear, I understand the kernel virtual address space
being the scarce resource. I was asking about the possibility of
using a mechanism similar to paging where it was actually reusing
portions of the virtual address space. Of course, demand-paging
wouldn't be possible since the same (virtual) addresses would
refer to different data, depending on some context. Given the
apparent complexity of normal paging, my idea seems completely
infeasible.
v7 UNIX, CP/M, MP/M, and DOS used to use an approach called
overlays to do exactly this. v8 and v9 UNIX also did this (they had
to fit in the 16 bit address space of the PDP hardware, meaning 64K
of code and 64K of data, at best), but you are unlikely to find
them anywhere.
To do this, you still have to break the dependency relationship
between overlays and overlay code: code that is loaded in an
overlay can't depend on other code in the same overlay region.
Interestingly enough, I'm looking at an older 16-bit embedded system
that takes this approach. By carefully maintaining an overlay call
stack (I don't have a better name for it) and a table of functions
and which overlay they are in, it's possible for one function to
call another function in a different overlay by calling a helper
function that's always mapped into the address space that remaps the
other region. In this case, the code resides in several ROMs and the
mapping can be changed by a single instruction. Obviously it'd be
more complicated to do in a general purpose kernel.
Call graphs are by definition directed graphs. DCGs are not severable,
while DAGs are. It is computationally hard to detect what are call
Hamiltonian cycles - A calls B calls C calls ... calls B - but it's
possible using what's called Warshal's algorithm, and if you are
clever, it's possible to do this incrementally.
So it's not actually that complicated to do, but it dies need planning.
It is possible to do this for third party drivers using a technique
called static analysis. This is relatively straight forward for fix
sized aligned instructions -- call them ppc and arm -- and a lot
harder for variable sized instruction set processors. You
effectively look at the code and decide the subtrees of code that
will be used for a given set if code: the functions called by the
functions called by the ... even through calls through funtions
pointers or vtables.
Certainly this is not decidable in general by straight-forward
reduction from the halting problem. When is the analysis happening?
When the driver is loaded into the kernel?
It's not analogous to the halting problem. It just becomes 4! harder
when you have to guess the instruction set boundary on a 32 bit
processor, and you end up with false positives when you guess wrong.
It comes down to determining code reachability.
Ideally, you do the work ahead of time so the calculation can be done
once. For dtrace on PPC, we actually analyze at runtime using some
very smart code (my office mate has a patent pending on the technique).
So it's doable, but the amount if effort involved would be more
than, for example, moving the graphics card push pages out of the
kernel address space, which would be an order of magnitude win over
that approach.
I'm not entirely sure what graphics card push pages are, but it
seems pretty clear that an overlay approach is complex, one that
seems unnecessary given a larger kernel virtual address space.
Graphics cards like to do speculative work while there are cycles
available not doing other things. For this to work, they need the
ability to force eviction of the results when they aren't going to
actually be needed, to make room for new actual or speculative work.
Straight LRU would evict stuff they actually need, since they are
actually calculating ahead into increasingly less probable futures.
The current implementation for most modern graphics drivers uses
kernel page mappings for this to avoid the overhead of protection
domain crossings (switching between kernel mode - ring 0 - and user
mode - ring 3 - is expensive due to current processor designs). So
they have to be able to push these mappings out when they become an
excluded future, and that means they live in kernel address space,
which makes that space unavailable for other things.
There are tricks you can do in terms if not doing the work for backing
store pages, but they are only useful on benchmarks. If a human clicks
a window forward and you avoided calculating the backing store
contents if that window because it was occluded and you thought you
could get away with it, then you have to eat a user-visible delay
while you do the calculation you could have done speculativllely. So
if you write to the benchmarks, the numbers are good but the real
world performance is terrible.
This is one of the reasons why graphics nerds are obsessed with frame
rates past the 60 a second you can possibly ever display on an LCD
panel with a 120Hz refresh rate: they don't know *which* 60 of the
ones they've calculated they will need to display.
As always, thanks for the fascinating look into the computing past.
It's not ancient history, as much as we might wish it were. 8-).
Everything old is new again.
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden