Re: Unravelling the mysteries of F_PREALLOCATE (on APFS and HFS+)
Re: Unravelling the mysteries of F_PREALLOCATE (on APFS and HFS+)
- Subject: Re: Unravelling the mysteries of F_PREALLOCATE (on APFS and HFS+)
- From: James Bucanek <email@hidden>
- Date: Thu, 14 Dec 2017 22:19:39 -0700
Dominic Giampaolo <mailto:email@hidden>
December 14, 2017 at 9:17 AM
Which brings me to my first (and biggest) question: F_PEOFPOSMODE allocates from the
"physical" end of file. What is the physical end of file?
Is is the current size of the file as reported by stat() in the st_size field.
This is different than the amount of data allocated to the file which is
returned in st_blocks. The difference between st_blocks * fs-block-size and
st_size is how much the file has pre-allocated. (well you should round up
st_size to a multiple of the fs block size before doing the subtraction).
Thanks a million. That piece of information has helped me immensely. I
have now (a) figured out what's going wrong with F_PREALLOCATE, (b)
discovered that APFS and HFS+ treat F_PREALLOCATE differently, and (c)
believe I've figured out a workaround.
Let's say I have a 1MB file and request a 2MB preallocation. Afterwards, is the
"physical" eof 1MB or 3MB? If I perform another 2MB preallocation will the preallocated
space remain at 2MB or will it grow to 4MB? If the latter, how does one determine the
"physical" end of file?
As noted above, the physical end of file is simply the size reported for the
file.
But this is the crux of your issue: you're asking to grow the file but the amount
you want to grow it by doesn't exceed the current amount already pre-allocated for
the file and so apfs returns EINVAL (error 22). Put another way, if (st_size +
amount_you_asked_to_allocate)< (st_blocks * fs_block_size) then apfs returns
EINVAL.
Agreed, and I think this is the root of my problem.
#1 The APFS F_PREALLOCATE bug
Back to my bug report. I can now reproduce the bug in several different
ways, and I also believe I understand (at least indirectly) why it's
failing.
I modified my test code to perform an fstat() on the file before, and
again after, each preallocation request. This let me compare the
preallocation request (fst_length -> fst_bytesalloc) against the actual
change in the file's allocated size (st_blocks).
My first success at failure was by requesting a 1MB pre-allocation on an
empty file, writing a small (20K) block of data, then requesting a 1MB
pre-allocation again, repeating until it failed. Here's what happened
<--------- before F_PREALLOC request --------> <- request / result ->
<--------------- after request --------------->
#: leof st_size st_blocks( *512) fst_length
alloc st_size st_blocks( *512) blocks-delta
1: 0 0 0( 0) 1048576
1048576 0 2048( 1048576) 1048576
2: 20480 20480 2048( 1048576) 1048576
20480 20480 2088( 1069056) 20480
3: 40960 40960 2088( 1069056) 1048576
20480 40960 2128( 1089536) 20480
4: 61440 61440 2128( 1089536) 1048576
20480 61440 2168( 1110016) 20480
Before the first preallocation, the file is empty, st_blocks is 0. After
requesting a 1MB preallocation, it changes to 2048 (1MB). This is as
expected and agrees with the amount returned in fst_bytesalloc (the
alloc column).
After writing 20K, another 1MB preallocation is requested. This time
there is 1MB-20K of preallocated space still left, so F_PREALLOCATE
allocates only the difference between what's already preallocated and
the request, which is exactly 20K. The request reports that 20K was
preallocated (fst_bytesalloc) which agrees with the change in st_blocks
(from 2048 to 2088, or 20K).
This goes on for some time without any problems. Then, the surprise...
4845: 99205120 99205120 195768( 100233216) 1048576
20480 99205120 195808( 100253696) 20480
4846: 99225600 99225600 195808( 100253696) 1048576
20480 99225600 195848( 100274176) 20480
4847: 99246080 99246080 195848( 100274176) 1048576
20480 99246080 228616( 117051392) 16777216
preallocation failed: errno=22
Request #4,846 requests 1MB, allocates 20K, the file allocation grows by
20K, and all is good. Then request #4,847 requests exactly the same, but
this time the preallocation reports that it allocated 20K but actually
allocated 16,777,216 (16MB), based on the change in the st_blocks.
The next request for 1MB fails with errno 22 (EINVAL). As explained
earlier, this is likely because the amount requested (1MB) is less than
the amount already allocated.
Here's my theory: I think APFS is applying it's own preemptive
preallocation logic based on how the file is getting written, and at
some point it internally decides to preallocate more space so all of
those little write() calls won't create a horribly fragmented file. But
because the F_PREALLOCATE now requires that you don't request less that
wha't already been preallocated, the next call fails.
I've found combinations of write and preallocation sizes that run
indefinitely without any problem and others that meet a similar fate.
I'll be uploading my bug report with this information shortly.
#2 HFS+ is different
I've also discovered, using the same code, that HFS+ treats
F_PREALLOCATION differently. Specifically, it always adds to the file's
allocation, by whatever is requested. Two successive, 1MB, preallocation
requests will allocate 1MB of space on APFS, but 2MB on HFS+. Which also
explains why my code never failed on HFS+; it never matters how much
space has already been preallocated, F_PREALLOCATE will just add to it.
#3 My workaround
My workaround, which seems to work under limited testing, is to not
trust fst_bytesalloc to report how much space is/was actually allocated.
By using fstat(), I can determine how much space is/was actually
allocated and adjust my logic so I don't make a "short" allocation
requests. This leads so some strange allocation results (including
requests that allocate nothing), but no errors so far:
#: leof st_size st_blocks( *512) fst_length
alloc st_size st_blocks( *512) blocks-delta
5850: 119787520 119787520 235968( 120815616) 1048576
20480 119787520 236008( 120836096) 20480
5851: 119808000 119808000 236008( 120836096) 1048576
20480 119808000 236048( 120856576) 20480
5852: 119828480 119828480 236048( 120856576) 1048576
20480 119828480 268816( 137633792) 16777216
5853: 136601600 136601600 268816( 137633792) 1048576
16384 136601600 301584( 154411008) 16777216
5854: 153374720 153374720 301584( 154411008) 1048576
12288 153374720 334352( 171188224) 16777216
5855: 170147840 170147840 334352( 171188224) 1048576
8192 170147840 367120( 187965440) 16777216
5856: 186920960 186920960 367120( 187965440) 1048576
4096 186920960 399888( 204742656) 16777216
5857: 203694080 203694080 399888( 204742656)
1048576 0 203694080 399888( 204742656) 0
5858: 203714560 203714560 399888( 204742656) 1048576
20480 203714560 432656( 221519872) 16777216
5859: 220487680 220487680 432656( 221519872) 1048576
16384 220487680 465424( 238297088) 16777216
5860: 237260800 237260800 465424( 238297088) 1048576
12288 237260800 498192( 255074304) 16777216
5861: 254033920 254033920 498192( 255074304) 1048576
8192 254033920 530960( 271851520) 16777216
5862: 270807040 270807040 530960( 271851520) 1048576
4096 270807040 563728( 288628736) 16777216
5863: 287580160 287580160 563728( 288628736)
1048576 0 287580160 563728( 288628736) 0
Bonus: this logic should work for both APFS and HFS+
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Filesystem-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden