Re: Problem with fcntl/F_LOG2PHYS (was: Re: How to read files from disk directly?)
Re: Problem with fcntl/F_LOG2PHYS (was: Re: How to read files from disk directly?)
- Subject: Re: Problem with fcntl/F_LOG2PHYS (was: Re: How to read files from disk directly?)
- From: Mark Day <email@hidden>
- Date: Thu, 7 Jul 2011 10:24:17 -0700
On Jul 6, 2011, at 5:52 PM, Stan Sieler wrote:
> Hi,
>
> Someone mentioned using fcntl ( , F_LOG2PHYS) ...
> out of curiosity, I experimented with it.
>
> I noticed that it returns bad information for a very few files,
> with l2p_devoffset having a value of 0xfffffffffffff000 ...
> which would be one heck of a disk drive!
That looks suspiciously like the block number it got back from the file system was -1. A block number of -1 from VNOP_BLOCKMAP means a "hole" -- that portion of the file is sparse. HFS and HFS Plus don't implement sparse files. But they do have two features which might trigger this behavior.
The first is delayed allocation. If there is sufficient free space, it won't bother immediately trying to figure out where on disk to store the file. When it finally needs to write out the file's data, then it decides where to allocate it, and then writes it. This is helpful for short lived files that may come and go before they ever get written to disk. It is also helpful in determining the exact size of the file (since it is likely to be at its final size by the time we write its contents to disk), which helps us find a better place to store the file (and not have to preallocate extra space in case the file grows immediately).
The second is delayed zero filling. If you grow the size of a file without writing to that new space, then we need to zero fill it for security purposes. But if you're just going to write to it a little later, writing all of those zeroes is a big waste. So we keep track of newly allocated, but not yet written, ranges within the file. Basically, we temporarily pretend like the file is sparse. And in fact, we return block number -1 from VNOP_BLOCKMAP in this case so that other parts of the OS will automatically return zeroes to user space. (Note: those zeroes *do* get written eventually.)
> Has anyone successfully used fcntl/F_LOG2PHYS?
> Or, can anyone shed light on why I should be getting this kind of result?
>
> One of the failing files is a VMWare Fusion ".vmem" file ...
> perhaps the file is sparse and the first page of the file isn't allocated disk storage.
> (Thus, an lseek (fid, 0, SEEK_SET) would work, but an fcntl/F_LOG2PHYS would not have
> valid data to return.)
>
> But... the problem also occurs for each of the six Bitcoin "db" files (e.g., __db.001 through __db.006):
Hmm, databases and a a virtual memory kind of file. Those sound like they might trigger delayed zero filling.
As a workaround, you might try calling fsync() on the file descriptor before you call fcntl(... F_LOG2PHYS...). I think that will force the zero filling to happen immediately, and you should then get the real on-disk location. Note that doing so could cause a performance problem.
[...]
> I tested reading every byte of the __db.### files (to ensure the pages
> get allocated if they're sparse files) ... the problem remains afterwards,
> which tends to argue against sparse files being the problem.
> (Of course, if the kernel is smart enough not to allocate those pages just for reads,
> then I haven't eliminated the chance that they're sparse :)
Reads from sparse areas of a file do not cause those areas to be allocated. You just get zeroes back. You have to write to the sparse areas to cause them to be allocated. But as I mentioned above, HFS does not support sparse files persistently.
> The kernel might be doing:
>
> case F_LOG2PHYS:
> ...
> vp = (struct vnode *)fp->f_data;
> VOP_LOCK(vp, LK_EXCLUSIVE, p);
> if (VOP_OFFTOBLK(vp, fp->f_offset, &lbn))
> panic("fcntl LOG2PHYS OFFTOBLK");
> if (VOP_BLKTOOFF(vp, lbn, &offset))
> panic("fcntl LOG2PHYS BLKTOOFF1");
> error = VOP_BMAP(vp, lbn, &devvp, &bn, 0);
> VOP_DEVBLOCKSIZE(devvp, &devBlockSize);
> VOP_UNLOCK(vp, 0, p);
> if (!error) {
> l2p_struct.l2p_flags = 0; /* for now */
> l2p_struct.l2p_contigbytes = 0; /* for now */
> l2p_struct.l2p_devoffset = bn * devBlockSize;
> l2p_struct.l2p_devoffset += fp->f_offset - offset;
> error = copyout((caddr_t)&l2p_struct,
> (caddr_t)uap->arg,
> sizeof (l2p_struct));
>
> which might imply that "bn" (from VOP_BMAP) is bad.
Those VOP_xxx names indicate you're using a really old version of the source. We switched to VNOP_xxx routines in Mac OS X 10.4.
> (On my HP 3000, I'd just put a breakpoint in the kernel at the equivalent
> code and run until I hit it, and then look at the kernel variables ...
> don't know if this is possible on Mac OS X.)
Like Ken Hornstein said, kernel (and kernel extension) debugging is done with two machines, and gdb supports the protocol. You would connect to the target machine over Ethernet or FireWire. (I don't know if the new Thunderbolt machines support kernel debugging over Thunderbolt.)
-Mark
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Filesystem-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden