On Jul 3, 2011, at 12:44 PM, Eric Gorr wrote:
So, the primary question I have is when reading the data from my device directly, how can I determine exactly what data belongs with what file?
You've already gotten a pointer to TN1150 and fcntl(F_LOG2PHYS) as literal answers to your question. But let me further discourage you or anyone else reading the list from this approach.
Never ever write to the device directly while a file system is mounted. The file system code assumes it is in sole control of the contents of that file system's media. If you write behind the file system's back, you are just asking to corrupt file content or metadata.
Can you guarantee that the file system is mounted read-only? By default, writable media is mounted read/write. If an HFS file system is read/write, then file contents can move at any time, and so the file may no longer be where it was when you called fcntl(F_LOG2PHYS).
If the file system is read/write, then some other process may be writing to the file system. In that case, the file contents may be in the buffer cache, and not yet written to disk.
Might the file system contain compressed files? (Mac OS X installs do.) If so, then the extents you get back (if any!) from fcntl(F_LOG2PHYS) point at the compressed data. And in some cases, the compressed file content is stored inside a node in the Attributes B-tree, and fcntl(F_LOG2PHYS) isn't going to return any extent information for you.
Might the file system be something other than HFS, perhaps NTFS? Note that NTFS uses a mechanism that replaces a few bytes of each sector with an ID number, and stores the original bytes elsewhere. So even if you knew where the contents were on disk, there's more information you'd need to reconstruct the file contents.
HFS Plus can have up to 4 billion allocation blocks. An allocation block number takes 4 bytes to store. In the worst case, you'd need 16GiB just to keep track of all the blocks in logical order. And that doesn't count overhead for associating blocks with files.
On Jul 4, 2011, at 5:16 AM, Eric Gorr wrote:
The issue is that there are 1 million + files and the vast majority of them are small (only a few kb). Based on my tests, the overhead of opening and reading the data for every single file is significant and I figured, if it were practical, that it would be nice to be able to read a few hundred of them in one shot and send them off to a worker thread for processing.
Ah, that helps. When asking questions, it helps to give the underlying problem you're trying to solve (rather than asking about a specific solution). That way, people can point you at other kinds of solutions that might be better. Or they may be able to suggest different ways of approaching the overall goal to avoid the underlying problem you're asking about.
The first thing to do is large, aligned reads. If the file is small, consider reading the entire file in a single read() call. If it is bigger, try to read it in pieces that are a multiple of the file system's block size and a multiple of the VM page size. Multiples of 4KB are usually a good choice. Powers of 2 from 128KB to 16MB are pretty good choices (perhaps go a little larger for big RAID arrays). Beware of really large reads; they'll cause memory pressure and VM thrashing that will slow things down by several orders of magnitude. (I've seen a case where a really large read ended up paging out data from the beginning of that read before it finished the read. When the caller went to look at its buffer, that data had to be paged in -- read in from disk a second time.)
Will you (or some other code) be reading the same file(s) multiple times in a short time period? If not, use fcntl(F_NOCACHE) to avoid polluting and thrashing the cache. Note that F_NOCACHE is just a hint. If some of the file's content has already been cached, your read will get it out of cache. F_NOCACHE just tries to prevent putting any more of the file's content in the cache. With F_NOCACHE, it is important to do aligned reads; if you don't; the file content will be cached anyway.
If you need random access within the file, and it is not too big or too small, and is on an internal disk, you could memory map the file ("man mmap"). Memory mapping files on external disks is possible, but if the disk gets hot unplugged, faulting in the file's content can fail; that will typically crash your process unless to go to extreme measures.
Are you overlapping I/O and computation? Are you giving the file system and the disks enough to allow them to schedule the disk heads? Consider using async dispatch blocks ("man dispatch") or multiple threads to read from several files at once, which gives the disks a better chance at arranging the I/Os in order (to help with seek delays). Overlapping I/O and computation can help you get better throughput from the drive, and may help reduce the affects of rotational delays.
If seeks really are the problem, one thing you could experiment with is fcntl(F_LOG2PHYS) to figure out which order to try and read the files via the normal open/read/close APIs. It might not be worthwhile to get all of the extents for a file; maybe just try getting the first physical block of each file and sorting based on that value. Again, consider overlapping this with the actual reading and processing of the files. My guess is that the complexity won't be worth the performance gain. If you do try this, I'd be curious to hear how it goes for you.
Do you have access to ADC on iTunes (past WWDC videos)? If so, go look at the WWDC 2008 sessions and find session 375 "Using Filesystem APIs Efficiently." In the Mac OS X Developer Library, take a look at File-System Performance Guidelines. If your processing involves passing the files to framework APIs (especially more than one), then us CFURL or NSURL with a file URL, and don't destroy and recreate it each time; try to pass the same URL object to all of the APIs you call since the URL will cache metadata about the file and reduce calls into the OS.
I hope that helps,
-Mark
|