Re: thread_t, uthread_t, at al.?
Re: thread_t, uthread_t, at al.?
- Subject: Re: thread_t, uthread_t, at al.?
- From: Terry Lambert <email@hidden>
- Date: Fri, 27 Oct 2006 16:10:41 -0700
On Oct 27, 2006, at 12:42 PM, Rick Mann wrote:
On Oct 27, 2006, at 12:21 , Michael Smith wrote:
There is a reason that lsof does this. The information you're
looking for is not maintained by the system; it would be expensive
to do so, and the work would almost always be wasted.
I'm not sure I agree with this assertion. I don't think it would be
asking too much. The Mac OS (Carbon) file system was able to do it,
and I hardly think that support could be blamed for any significant
slowdown. (Mind you, this is speculation).
Carbon did this in pre-Mac OS X days by running in the same address
space; if you don't do protection domain crossing for each piece of
information, it's a heck of a lot easier to do what you are trying to
do. If you have to cross protection domains (e.g. because you have an
OS that prevents one application from brining unrelated applications
down (best case) or the whole system down (worst case), then it's very
expensive.
The what things are stores is that there is a list of processes.
For each of these processes, there is a per process open file table.
For each per process open file table entry, there's a pointer to a
fileglob. Multiple open file instances in the same process and in
other processes in the system (either child, parent, or processes
which have used UNIX domain sockets to pass an open descriptor around)
can point to the same fileglob.
For a fileglob, there's a pointer to an fd_data; what this represents
depends on the type of entry that it represents (e.g. vnode, pipe,
socket, etc.). Multiple file globs can point to the same thing (in
this case, you are caring about vnodes).
The vnode points to the v_data, which represents the in core data
necessary to access a vnode object - usually, a filesystem object ;
there is *usually* a 1:1 correspondence between vnode and per-FS
instance object; in the case of HFS, this would be a cnode structure,
or for UFS, an inode structure.
So basically, you have two places in the chain you want to back-track
where the OS doesn't store a list of "who all points to me?".
You're correct that this *could* be maintained by the OS, but you're
*wrong* when you say "I hardly think that support could be blamed for
any significant slowdown".
The problem is that in order to maintain these lists, you would need
to allocate list element structures in both cases (this is doable -
it's just a memory penalty); however, when you went to insert and
remove elements from these lists, you'd have to enforce a
serialization barrier on list insertion, deletion, lookup,
uniquification, etc., etc.. Pre-protected-mode Mac OS X would handle
this by duplication of data and/or hlock()/hunlock() - serializing
either data validity or serializing access.
This isn't really expensive on a non-preemptive multitasking system
where you don't have protection domain crossing or address space
switching, and for which you only have a single CPU that you're
stalling out until the operation completes. But when you start
talking about modern machines and modern OS's that are inherently more
safe vs. viruses, etc., then you are starting to add up to some real
performance penalties.
So far though, I can't think of anything that involves tracking
the process to which a thread belongs in what I've outlined
above. You aren't making an assumption about v_owner, are you?
Looking at the sources, v_owner is either null or it's
current_thread(), but I guess that can change as a file is
accessed, so that won't be reliable at all.
Modulo bugs in the code, it is completely reliable. It's just not
what you thought it was.
I mean, it's not reliable as an indicator of which process is
keeping a volume busy, my end goal.
If this is your goal, you need to reconsider the information you think
you need in order to accomplish it.
In this case, you have a volume that's being held open by (presumably)
a small number of vnode references, and you want to know where they
are coming from. You can do this the slow way, which is to walk
everything, or you can do this the fast way.
The primary reason lspf is slow in this case is that it displays all
the information, and in obtaining this information, it bushes all the
data, as individual data items, across the user/kernel protection
domain boundary. So you are basically spending all your time in TLB
shootdown, flushing, address space crossing, and copying data that not
interesting for the problem you are trying to solve.
So don't do that - pretty simple.
What you likely want to do is make yourself a KEXT that needs to be
recompiled vs. each instance of the kernel, so that if anything
changes, the promiscuous knowledge you are using to walk the data
structures (i.e. "the data structure has such-and-such a layout", "the
pointer to such and such a list starts here", "this list is a STAILQ
vs. a TAILQ or an SLIST", etc., etc. -- all the implementation we
won't promise won't change between software updates) is valid for the
kernel you load the thing into.
Then you ask your KEXT to send you only the information you want: you
iterate the vnode list off the mount point using the locking-of-the-
day that happens to be implemented in a particular version of the OS,
and get together a list of vp's keeping the volume open.
The next part, you are going to want to take in two passes; the first
is mandatory, the second, optional:
First pass, you iterate the proclist, and iterate the open file table
for each proc, and look in the fg_data pointer for the vp's you are
interested in. You copy this out all at once.
Second pass (if the first pass turns nothing up, or you are being
pedantic, even though you know you can't close the volume with the
vnodes held by the first pass), you go through the vnode object
mappings for all the processes, and look to see if one of the vm
objects associated with the process is associated with the vnode
pager, and if so, if the backing object for the pager happens to be
one of the vnodes you are interested in.
-
The benefits to doing it this way should be obvious from the above,
but I'll spell them out:
(1) The information from lsof is a snapshot; going at it this way, you
can get real data; byt the time you display it, it'll also be a
snapshot, but it will be a *consistent* snapshot
(2) The information doesn't have to have multiple boundary transitions
to get one piece of data (I think lsof is up in the neighborhood of 7
boundary crossing per vnode backed file to display anything, not
including the display code)
(3) You only push useful data, instead of pushing everything
(4) You are order N*M/2, rather than order N * M * K, where K is
large, because you iterate only for the vnodes you care about, instead
of all of them.
(5) You avoid all that copying of data, TLB mapping, shodown, buffer
flushing, etc..
The down side is that you have to recompile your KEXT each time an sU
happens, but if you are traking a bug in FS code you wrote yourself,
this should be no big deal.
All I want to so is add some very useful functionality that existed
in the pre-Mac OS X Mac: the ability to tell who's got a volume open
when you try to unmount it via the Finder. I can't actually fix the
Finder, but I can come close with a contextual menu plugin. The only
mechanism to determine this is lsof, which is slow and unreliable
(it does not detect files opened by the Kernel). I was hoping to
find another way.
Actually, no. If you wanted to do this, you could simply write an
fsevents listener, and remember who did what open, keeping a count on
the open and closes, removing records in your list when the open cout
for a given process goes to 0.
If you wanted to do this via a KEXT, and query it for the information,
that's really easy, too, by using a KAUTH listener for open and close
events.
I'd be willing to experiment with adding support to the Kernel & BSD
to keeping track of these things as files are opened and closed (as
I said above, I don't agree with your assertion that this needs to
have any significant impact on performance).
If you want to do this for your own use, that's fine, but the
performance penalties on a heavily loaded system could be immense (see
above).
-- Terry
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-kernel mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
References: | |
| >thread_t, uthread_t, at al.? (From: Rick Mann <email@hidden>) |
| >Re: thread_t, uthread_t, at al.? (From: Michael Smith <email@hidden>) |
| >Re: thread_t, uthread_t, at al.? (From: plumber Idraulico <email@hidden>) |
| >Re: Re: thread_t, uthread_t, at al.? (From: "Shawn Erickson" <email@hidden>) |
| >Re: thread_t, uthread_t, at al.? (From: Rick Mann <email@hidden>) |
| >Re: thread_t, uthread_t, at al.? (From: Michael Smith <email@hidden>) |
| >Re: thread_t, uthread_t, at al.? (From: Rick Mann <email@hidden>) |
| >Re: thread_t, uthread_t, at al.? (From: Michael Smith <email@hidden>) |
| >Re: thread_t, uthread_t, at al.? (From: Rick Mann <email@hidden>) |