Re: How can I access mnt_devblocksize from user space?
Re: How can I access mnt_devblocksize from user space?
- Subject: Re: How can I access mnt_devblocksize from user space?
- From: Sam Vaughan <email@hidden>
- Date: Thu, 25 Sep 2008 12:12:35 +1000
On 25/09/2008, at 4:27 AM, Kevin Elliott wrote:
It's a lot less fragile than you think. Functionally speaking,
that key name is equivalent to a fcntl selector- changing it would
break lots of code, and lots of code relies on it being there. It
isn't going anywhere.
That's good to know. The concern from my perspective is that named
properties in IOKit nodes don't make for as concrete an interface as
structures and enumerations in BSD header files. They're simply too
easy to change. In a past life as a clustered file system weenie I
was burnt several times by changes to the layout of the IOKit
registry. I understand that things are more mature now, but old
wounds take time to heal. :o)
Understand, the BSD layer sits on top of IOKit, not the other way
around. Any of the fcntl that provide data on hardware get that
data using IOKit. Indeed, if you look at the source for
IOMediaBSDClient you'll find a big switch statement that maps fcntl
selectors to IOKit keys, then uses those keys to get the requested
data.
I understand the layering of BSD over IOKit. I just think that in
this case there's a hole in the BSD layer forcing me to reach through
and do some messy IOKit stuff when it could all be so much simpler.
I doubt there are many programmers who wouldn't choose a one line BSD
call over the IOKit alternative. That's why I was asking whether an
enhancement request might be well received.
A field in struct statfs would probably be more appropriate than a
new fcntl since there are so many other similar properties there
already. It's a bit odd really when you think about it. There's not
much point advertising an optimal transfer block size if you don't
also point out that it will only give good results if the file
offsets used are aligned to 4k boundaries.
Yes, It's quite possible you'll see greater than 4k. 8k is fairly
common on the mid to high end, and I think I've heard of 16k on the
very high end. Basically as the number of drives in a RAID
increases, it's common to increase the block size to encourage the
OS to use read sizes that give good performance.
That's interesting. I wonder whether the cluster vfs layer ends up
in cluster_copy_upl_data when mnt_devblocksize is a multiple of 4k.
I'd have thought that once the page size barrier is crossed, pages
could simply be remapped in upls as appropriate. I wonder if Joe
reads filesystem-dev...
If anyone's curious and might have such a RAID, it's easy to find
out. Here's how I've been doing my testing:
- - - - - - - -
Save the two files at the bottom of this email to disk.
Compile the C file:
$ gcc -o devblocksize devblocksize.c
Create a symlink to a large file called "bigfile":
$ ln -s /some/large/file bigfile
$ ls -lLh bigfile
-rw-r--r-- 1 samv staff 807M Sep 19 2005 bigfile
Run the test from starting offset zero:
$ sudo su
# ./devblocksize.d & time ./devblocksize bigfile 0; kill %1
Read 0x3274eb76 (846523254) bytes
real 0m2.031s
user 0m0.001s
sys 0m0.270s
mnt_devblocksize is 0x1000 (4096)
Function Calls
cluster_copy_upl_data 1
cluster_read_ext 775
pread_nocancel 775
Function Total Bytes
cluster_copy_upl_data 2934
pread_nocancel 812646400
Function CPU Time (ns)
cluster_copy_upl_data 15862
cluster_read_ext 221218438
pread_nocancel 225094068
Note that on this volume, mnt_devblocksize is 4k. The starting
offset of zero is aligned, so cluster_copy_upl_data was only called
once, to copy the final 2934 bytes of the file. Run the test again
from starting offset 512:
# ./devblocksize.d & time ./devblocksize bigfile 512; kill %1
Read 0x3274eb76 (846523254) bytes
real 0m3.881s
user 0m0.002s
sys 0m0.795s
mnt_devblocksize is 0x1000 (4096)
Function Calls
cluster_read_ext 779
pread_nocancel 779
cluster_copy_upl_data 780
Function Total Bytes
pread_nocancel 816840704
cluster_copy_upl_data 817162614
Function CPU Time (ns)
cluster_copy_upl_data 413313593
cluster_read_ext 721742406
pread_nocancel 726990224
This time cluster_copy_upl_data is used every time, and accounts for
more than half the total CPU time of the operation.
- - - - - - - - devblocksize.c - - - - - - - -
#include <fcntl.h>
#include <stdio.h>
#include <sys/param.h>
#define IO_SIZE (1 << 20)
#define BAIL_IF(cond, fmt, args...) \
if (cond) {fprintf(stderr, fmt, ##args); err = 1; goto bail;}
#define PBAIL_IF(cond, func) \
if (cond) {perror(#func " failed"); err = 1; goto bail;}
int main(int argc, char** argv)
{
int err = 0;
BAIL_IF(argc != 3, "usage: %s <file> <start-offset>\n", basename
(argv[0]));
off_t off = strtoll(argv[2], NULL, 10);
char* buf = (char*)valloc(IO_SIZE);
PBAIL_IF(!buf, valloc);
int fd = open(argv[1], O_RDONLY, 0);
PBAIL_IF(fd < 0, open);
err = fcntl(fd, F_NOCACHE, 1);
PBAIL_IF(err < 0, fcntl);
ssize_t bytes;
for (bytes = IO_SIZE; bytes == IO_SIZE; off += bytes)
{
bytes = pread(fd, buf, IO_SIZE, off);
PBAIL_IF(bytes < 0, pread);
}
fprintf(stderr, "Read 0x%llx (%lld) bytes\n", off, off);
bail:
if (buf)
free(buf);
return err;
}
- - - - - - - - devblocksize.d - - - - - - - -
#!/usr/sbin/dtrace -s
#pragma D option quiet
:mach_kernel:pread_nocancel:entry /execname=="devblocksize"/
{
@agg[probefunc] = count();
@sum[probefunc] = sum(((struct pread_nocancel_args*)arg1)->nbyte);
self->start[probefunc] = vtimestamp;
}
::cluster_read_ext:entry /execname=="devblocksize"/
{
blocksize = ((vnode_t)arg0)->v_mount->mnt_devblocksize;
@agg[probefunc] = count();
self->start[probefunc] = vtimestamp;
}
::cluster_copy_upl_data:entry /execname=="devblocksize"/
{
@agg[probefunc] = count();
@sum[probefunc] = sum(*(int*)arg3);
self->start[probefunc] = vtimestamp;
}
:mach_kernel:pread_nocancel:return,
::cluster_read_ext:return,
::cluster_copy_upl_data:return /execname=="devblocksize"/
{
this->time = vtimestamp - self->start[probefunc];
@times[probefunc] = sum(this->time);
}
dtrace:::END
{
printf("\nmnt_devblocksize is 0x%x (%d)\n", blocksize, blocksize);
printf("\n!s s\n", "Function", "Calls");
printa("!s %@13d\n", @agg);
printf("\n!s s\n", "Function", "Total Bytes");
printa("!s %@13d\n", @sum);
printf("\n!s s\n", "Function", "CPU Time (ns)");
printa("!s %@13d\n", @times);
}
- - - - - - - -
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Filesystem-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden