Get the underlying cache block size of a filesystem?
Get the underlying cache block size of a filesystem?
- Subject: Get the underlying cache block size of a filesystem?
- From: James Bucanek <email@hidden>
- Date: Thu, 11 Feb 2010 09:20:28 -0700
Greetings,
Disclaimer: I'm a big propionate of "don't fight the OS" and
"don't reinvent the wheel", but sometimes you just have to. My
application presents a pathological case that is discordant with
the filesystem's natural caching algorithm, and I'm trying to
find a more efficient solution.
Scenario: I have a very large index file (GBs) that has millions
of very small records (12 bytes each). Records are read
stochastically while parsing an incoming stream of data. You can
guess what happens: (1) an entire of block of data is read from
the disk and cached, (2) 12 bytes are extracted, (3) the block
eventually gets pushed out of the cache by other file activity,
process repeats.
The result is that almost every record read incurs the overhead
of reading the 4K/8K/32K/whatever block of data from the
physical media. This effectively "overreads" hundreds of
potentially interesting records in the same block, discards
them, and reads them again next time.
So the standard wisdom of "let the OS cache the data" isn't
working here.
I've *significantly* improved performance by creating my own
in-memory cache records. When I read a record, I calculate the
4K region of the file that the record resides in, read that 4K
span, and then copy all "interesting" records (typically 10-20%
of block) into a compact cache. The next time a record in that
range is requested, I can satisfy the request from the
cache--until it fills up, but that's rare.
This new technique is almost 100 times faster than the old one,
but I'd like to make it as efficient as possible. I realize that
my app could still be repeatedly reading the same block of data
if the OS always reads 16K or 32K at a time and I only cache 4K
of that. On the other hand, I don't want to arbitrarily increase
this value; reading too much at a time slows down individual
requests and causes the cache to fill too quickly. And I realize
that on a networked file system, the buffering size might be
considerably smaller and reading more just wastes bandwidth.
So I come to beg the gurus of filesystem architecture for
advice. Is there an API that I can use to discover the
actual/typical block read size employed by a filesystem or
filesystem cache? I've looked at the various Carbon functions
and things like fcntl(), but can't find anything (obvious).
Alternatively, is there a constant that I could reasonably
assume to be close to the actual read-block size in most
situations? I'm targeting OS X 10.4-10.6, although optimal
support for 10.4 isn't critical.
Thanks!
--
James Bucanek
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Filesystem-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden