How can I access mnt_devblocksize from user space?
How can I access mnt_devblocksize from user space?
- Subject: How can I access mnt_devblocksize from user space?
- From: Sam Vaughan <email@hidden>
- Date: Tue, 23 Sep 2008 16:19:41 +1000
When getting direct I/O running as fast as possible, it's important
to align the file offsets of every request to avoid the kernel having
to call cluster_copy_upl_data to uiomove everything. The performance
penalty of that is very high and should be easily avoidable.
I wrote a very simple C program to play around with that opens a
file, sets F_NOCACHE on it and starts issuing 1MB preads from an
offset passed in from the command line, stopping when it hits EOF.
The destination buffer for all the reads is always page aligned.
Running the test tool with the 'time' Bash built-in or monitoring it
with Shark or dtrace quickly shows the problem. If the initial
offset is zero, the reads are fast and the kernel CPU usage is very
low. If the initial offset is something nasty, cluster_copy_upl_data
gets involved, kernel CPU usage shoots up and the reads are slow.
For a long time I'd simply assumed that as long as the memory was
page aligned and the disk offset was 512 byte sector aligned, no
copies would ever be needed. Then about a year ago I was working on
code to read 2k uncompressed video and I discovered that on many
RAIDs, the alignment needs to be to 4k offsets in the file to avoid
the copies occurring.
What I'd like to know is whether this alignment requirement for any
given volume is easily accessible from user space, because I'd like
to set it dynamically.
Empirical testing using my little C program shows that my build
machine's local disk only requires 512 byte alignment to avoid the
copies, but my laptop, my home machine's software RAID and my test
machine's hardware RAID all require 4k alignment.
I've been using a dtrace script to detect calls to
cluster_copy_upl_data because the backtraces in Shark (and indeed the
output from a call to stack() in dtrace) seem so untrustworthy. They
both claim that cluster_read_ext calls cluster_pageout for instance!
Anyway, after reading some cluster_vfs code I added a line to my
dtrace script to save off vp->v_mount->mnt_devblocksize when
cluster_read_ext is entered. Sure enough, it contains the correct
magic value wherever I run my test. (dtrace really is awesome :o)
Looking in stat, statfs and getattrlist, I haven't been able to find
a field that exposes this value to user space. Browsing through xnu
in cscope, the getvolattrlist function looks promising, but it turns
out that it will only return mnt_devblocksize if the user asked for
f_bsize and the file system doesn't support that attribute.
I'm wondering if I've missed something obvious in the above APIs, or
whether there's a better way to get at the mnt_devblocksize field of
a mount_t structure from user space. Has anyone tried to do this
before, or is the general idea to simply go with 4k alignment and
leave it at that?
Thanks in advance for any ideas,
Sam
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Filesystem-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden