Re: FSRef speed impacts and permission checks
Re: FSRef speed impacts and permission checks
- Subject: Re: FSRef speed impacts and permission checks
- From: Mark Day <email@hidden>
- Date: Thu, 15 May 2008 13:58:09 -0700
On May 15, 2008, at 11:02 AM, Thomas Tempelmann wrote:
In a recursive walk, it will read individual B-tree nodes
(typically 8KB) as needed; they're often not sequential, so they may
require the drive to seek to a different track.
Well, if it boils down to that, then I'm still not satisfied.
Good! Skepticism is healthy, especially when it comes to explaining
performance.
Fact is that
in both cases _all_ data of the entire dir is retrieved exactly once
(unless
there are extra internal read accesses to check parent permissions,
but then
these should also be happening for both cases in the same amount).
The sizes and order of the I/Os will be different. That is likely to
cause a performance difference.
If you'd like to try a performance experiment, create a large (multi-
gigabyte) file, larger than twice your RAM. Then read every other
32KB chunk from start to end (0-32K, 64-96K, 128-160K, and so on).
Now do the same with every other 8KB chunk. (To get more repeatable
numbers, you may wish to reboot before each test, turn off networking,
don't launch other apps, etc.) I'll bet you find a non-trivial
difference between the two (32KB at a time being faster).
If you were to try a similar experiment where you read the entire
first half of the file (not skipping every other chunk), you'd find
the performance difference was less. And both 8KB and 32KB times
should be faster than the every other chunk test, even though you're
reading the same amount of data. The buffer cache notices forward
sequential access and will read ahead (typically 1MB or larger for
modern internal disks) so that most of your read calls will find all
of the requested data already in cache.
There is a per-I/O overhead for every I/O issued to the disk. Reading
32KB takes nowhere near four times as long as reading 8KB. The extra
24KB is almost free.
When I call GetCatalogInfoBulk, that function has the freedom to
return the
items in the order it prefers.
CatalogSearch has even more freedom. It can mix entries from one
directory in with entries from many other directories, in whatever
order it finds them.
And since on HFS nodes are ordered by their
name, seeking out the contents should still be rather sequential. I
mean,
the nodes are directly indexed by the node ids, so if I provide a
large
enough buffer to GetCatalogInfoBulk, it should be able to read all
items of
that dir in one go, and pretty fast. Should not be much different to
what
CatSearch runs into.
Well the keys of the items in the directory are *logically*
contiguous, but the nodes are not necessarily *physically* contiguous
(depending on how the directory was populated, and previous usage
patterns). The leaf records for a single directory might be in nodes
27, 198, and 36, in that order. GetCatalogInfoBulk will visit the
nodes in key order, not node number order. CatalogSearch visits nodes
in node number order, which is usually disk block order.
Plus getting to the first item in the directory requires a search of
the B-tree, which means reading one index node at each level of the
tree until you get to the leaf node containing the first item in the
directory. Some of those index nodes will already be in the buffer
cache. But some of them probably won't, which means extra reads, and
possibly seeks.
So, what really needs to be figured out is how many seeks the
recursive
approach needs and if those extra seek times are all what makes it
so much
slower.
By the way, is your recursive search depth first or breadth first?
Every time you recurse, you're jumping to a different part of the key
space (because you're jumping to a different parent directory ID, and
that's the primary field that the B-tree is sorted by). A depth first
search will do a lot more jumping back and forth through the key
sequence.
If it's a depth first search, in what order do you visit
subdirectories of previously visited directories? In theory, you
should keep a list of to-be-visited directories, sorted by their
directory ID. Every time you encounter a subdirectory, add it to the
sorted list. When you hit the end of the current directory, remove
the smallest directory ID from the to-be-visited list and walk that
directory next. That should reduce the jumping back and forth through
the key sequence. Many years ago, I modified a recursive search to
use this technique, and it was significantly faster than a depth first
search. But CatSearch still beat it.
I could test this by writing my own code scanning the catalog, so
that I can be sure nothing else is going on. Or is there a kernel-
level
logging facility that can tell me which sectors are read from disk
so that I
can generate a profile from that?
Glad you asked!
There is the fs_usage tool. You're interested in reading of metadata,
which is printed as "RdMeta". You can filter the output through grep
to show just the metadata reads to a specific partition. So if the
volume is mounted on disk0s2, you might do something like:
% sudo fs_usage -f filesys -w | grep --line-buffered RdMeta | grep --
line-buffered disk0s2
Run your test tool. Switch back to this Terminal window and Press
Ctrl-C to stop it.
The output would look something like the following (where I've removed
a lot of the spaces to make it fit better in email; the real output is
about 200 columns wide):
12:50:57.600 RdMeta D=0x00033c08 B=0x2000 /dev/disk0s2 0.017132 W
Finder
12:50:57.606 RdMeta D=0x00033a48 B=0x2000 /dev/disk0s2 0.004796 W
Finder
12:50:57.608 RdMeta D=0x00033bc8 B=0x2000 /dev/disk0s2 0.002029 W
Finder
12:50:57.609 RdMeta D=0x00033b68 B=0x2000 /dev/disk0s2 0.000168 W
Finder
The first column is the time of the I/O (I believe the completion
time). The "RdMeta" indicates the kind of operation, in this case
reading file system metadata from the disk driver. The number after
"D=" is the starting disk block number. For most hard disks, a disk
block is 512 bytes. The number after "B=" is the number of bytes.
The "/dev/disk0s2" is which disk (generally, which path). Then comes
the duration of the I/O in seconds. The "W" indicates that the
operation blocked (waited); this is more interesting when looking at
system calls, to see which ones completed without blocking; disk I/O
always blocks. Lastly is the name of the process (in this case, I was
browsing in the Finder). You could of course filter based on the name
of your process, too.
If you run an experiment comparing CatalogSearch to a recursive
search, you might want to sum up all of the byte counts. And see if
any of the I/Os are repeated (same "D=..." column). CatalogSearch
will read but ignore unused nodes. A recursive search may end up
reading some of the index nodes multiple times, depending on how good
your locality of reference is, and how many buffers the kernel has.
It might be enlightening to look at the distribution of durations
(notice the first one in the list above is more than 100 times slower
than the last one, yet they all read the same number of bytes).
In 10.5.0 and later, there is also dtrace. Attached is a shell script
named "bufio". It uses dtrace to provide output similar to the
above. It also shows the entry and exit of sync(2) system calls. You
might run it like:
% sudo ./bufio -disk disk0s2
Press Ctrl-C to stop it.
One of the really cool things about dtrace is that it has built in
support for simple statistics and associative arrays. If you pass the
-summary option to this script, at the end it will print out a list of
all of the I/Os and the number of times each I/O (matching type, disk,
block, byte count) occurred. It would be relatively easy to get it to
do a simple graph of the distribution of the durations.
NOTE: On a multi-core system, the output from dtrace may not be in
completely sorted order (by time stamp). Each core gets its own
logging buffer. The tool tends to get samples from one core, then
more samples from another core, and so on. If you're trying to
examine things in chronological order, you need to sort the output
based on a time stamp; for short samples, making sure the time stamp
prints in a fixed number of columns, then sorting textually (eg., the
sort(1) tool), is good enough.
Plus, I need to verify my suspicion that a root process does not
require the
file system to check the permissions on the parent dirs: If I run
the same
test from a non-root user, it must slow down pretty much to the
speed of the
recursive search because it should require a similar amount of random
directory reads.
Agreed?
I suspect a recursive search run as root will still be slower than
CatalogSearch run as non-root, due to CatalogSearch's more efficient I/
O pattern. I'd be interested to see the result.
-Mark
Attachment:
bufio
Description: Binary data
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Filesystem-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden