Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: FSRef speed impacts and permission checks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: FSRef speed impacts and permission checks

Subject: Re: FSRef speed impacts and permission checks
From: Mark Day <email@hidden>
Date: Thu, 15 May 2008 13:58:09 -0700

On May 15, 2008, at 11:02 AM, Thomas Tempelmann wrote:

In a recursive walk, it will read individual B-tree nodes
(typically 8KB) as needed; they're often not sequential, so they may
require the drive to seek to a different track.

Well, if it boils down to that, then I'm still not satisfied.

Good! Skepticism is healthy, especially when it comes to explaining performance.

Fact is that in both cases _all_ data of the entire dir is retrieved exactly once (unless there are extra internal read accesses to check parent permissions, but then these should also be happening for both cases in the same amount).

The sizes and order of the I/Os will be different. That is likely to cause a performance difference.

If you'd like to try a performance experiment, create a large (multi- gigabyte) file, larger than twice your RAM. Then read every other 32KB chunk from start to end (0-32K, 64-96K, 128-160K, and so on). Now do the same with every other 8KB chunk. (To get more repeatable numbers, you may wish to reboot before each test, turn off networking, don't launch other apps, etc.) I'll bet you find a non-trivial difference between the two (32KB at a time being faster).

If you were to try a similar experiment where you read the entire first half of the file (not skipping every other chunk), you'd find the performance difference was less. And both 8KB and 32KB times should be faster than the every other chunk test, even though you're reading the same amount of data. The buffer cache notices forward sequential access and will read ahead (typically 1MB or larger for modern internal disks) so that most of your read calls will find all of the requested data already in cache.

There is a per-I/O overhead for every I/O issued to the disk. Reading 32KB takes nowhere near four times as long as reading 8KB. The extra 24KB is almost free.

When I call GetCatalogInfoBulk, that function has the freedom to return the items in the order it prefers.

CatalogSearch has even more freedom. It can mix entries from one directory in with entries from many other directories, in whatever order it finds them.

And since on HFS nodes are ordered by their name, seeking out the contents should still be rather sequential. I mean, the nodes are directly indexed by the node ids, so if I provide a large enough buffer to GetCatalogInfoBulk, it should be able to read all items of that dir in one go, and pretty fast. Should not be much different to what CatSearch runs into.

Well the keys of the items in the directory are *logically* contiguous, but the nodes are not necessarily *physically* contiguous (depending on how the directory was populated, and previous usage patterns). The leaf records for a single directory might be in nodes 27, 198, and 36, in that order. GetCatalogInfoBulk will visit the nodes in key order, not node number order. CatalogSearch visits nodes in node number order, which is usually disk block order.

Plus getting to the first item in the directory requires a search of the B-tree, which means reading one index node at each level of the tree until you get to the leaf node containing the first item in the directory. Some of those index nodes will already be in the buffer cache. But some of them probably won't, which means extra reads, and possibly seeks.

So, what really needs to be figured out is how many seeks the recursive approach needs and if those extra seek times are all what makes it so much slower.

By the way, is your recursive search depth first or breadth first? Every time you recurse, you're jumping to a different part of the key space (because you're jumping to a different parent directory ID, and that's the primary field that the B-tree is sorted by). A depth first search will do a lot more jumping back and forth through the key sequence.

If it's a depth first search, in what order do you visit subdirectories of previously visited directories? In theory, you should keep a list of to-be-visited directories, sorted by their directory ID. Every time you encounter a subdirectory, add it to the sorted list. When you hit the end of the current directory, remove the smallest directory ID from the to-be-visited list and walk that directory next. That should reduce the jumping back and forth through the key sequence. Many years ago, I modified a recursive search to use this technique, and it was significantly faster than a depth first search. But CatSearch still beat it.

I could test this by writing my own code scanning the catalog, so that I can be sure nothing else is going on. Or is there a kernel- level logging facility that can tell me which sectors are read from disk so that I can generate a profile from that?


Glad you asked!

There is the fs_usage tool. You're interested in reading of metadata, which is printed as "RdMeta". You can filter the output through grep to show just the metadata reads to a specific partition. So if the volume is mounted on disk0s2, you might do something like:

% sudo fs_usage -f filesys -w | grep --line-buffered RdMeta | grep -- line-buffered disk0s2

Run your test tool. Switch back to this Terminal window and Press Ctrl-C to stop it.

The output would look something like the following (where I've removed a lot of the spaces to make it fit better in email; the real output is about 200 columns wide):

12:50:57.600 RdMeta D=0x00033c08 B=0x2000 /dev/disk0s2 0.017132 W Finder 12:50:57.606 RdMeta D=0x00033a48 B=0x2000 /dev/disk0s2 0.004796 W Finder 12:50:57.608 RdMeta D=0x00033bc8 B=0x2000 /dev/disk0s2 0.002029 W Finder 12:50:57.609 RdMeta D=0x00033b68 B=0x2000 /dev/disk0s2 0.000168 W Finder

The first column is the time of the I/O (I believe the completion time). The "RdMeta" indicates the kind of operation, in this case reading file system metadata from the disk driver. The number after "D=" is the starting disk block number. For most hard disks, a disk block is 512 bytes. The number after "B=" is the number of bytes. The "/dev/disk0s2" is which disk (generally, which path). Then comes the duration of the I/O in seconds. The "W" indicates that the operation blocked (waited); this is more interesting when looking at system calls, to see which ones completed without blocking; disk I/O always blocks. Lastly is the name of the process (in this case, I was browsing in the Finder). You could of course filter based on the name of your process, too.

If you run an experiment comparing CatalogSearch to a recursive search, you might want to sum up all of the byte counts. And see if any of the I/Os are repeated (same "D=..." column). CatalogSearch will read but ignore unused nodes. A recursive search may end up reading some of the index nodes multiple times, depending on how good your locality of reference is, and how many buffers the kernel has. It might be enlightening to look at the distribution of durations (notice the first one in the list above is more than 100 times slower than the last one, yet they all read the same number of bytes).

In 10.5.0 and later, there is also dtrace. Attached is a shell script named "bufio". It uses dtrace to provide output similar to the above. It also shows the entry and exit of sync(2) system calls. You might run it like:

% sudo ./bufio -disk disk0s2

Press Ctrl-C to stop it.

One of the really cool things about dtrace is that it has built in support for simple statistics and associative arrays. If you pass the -summary option to this script, at the end it will print out a list of all of the I/Os and the number of times each I/O (matching type, disk, block, byte count) occurred. It would be relatively easy to get it to do a simple graph of the distribution of the durations.

NOTE: On a multi-core system, the output from dtrace may not be in completely sorted order (by time stamp). Each core gets its own logging buffer. The tool tends to get samples from one core, then more samples from another core, and so on. If you're trying to examine things in chronological order, you need to sort the output based on a time stamp; for short samples, making sure the time stamp prints in a fixed number of columns, then sorting textually (eg., the sort(1) tool), is good enough.

Plus, I need to verify my suspicion that a root process does not require the file system to check the permissions on the parent dirs: If I run the same test from a non-root user, it must slow down pretty much to the speed of the recursive search because it should require a similar amount of random directory reads.

Agreed?

I suspect a recursive search run as root will still be slower than CatalogSearch run as non-root, due to CatalogSearch's more efficient I/ O pattern. I'd be interested to see the result.

-Mark

Attachment: bufio
Description: Binary data

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Filesystem-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Follow-Ups:
- Re: FSRef speed impacts and permission checks
  - From: Thomas Tempelmann <email@hidden>

References:
	>Re: FSRef speed impacts and permission checks (From: Thomas Tempelmann <email@hidden>)

Prev by Date: Re: FSRef speed impacts and permission checks
Next by Date: Re: FSRef speed impacts and permission checks
Previous by thread: Re: FSRef speed impacts and permission checks
Next by thread: Re: FSRef speed impacts and permission checks
Index(es):
- Date
- Thread