Re: bad fsync? (A.M.)
Re: bad fsync? (A.M.)
- Subject: Re: bad fsync? (A.M.)
- From: James Berry <email@hidden>
- Date: Mon, 21 Feb 2005 13:40:45 -0800
Hey Kevin,
I know this is a comparison of Apples and Oranges. I guess I was just
looking for a reality check on this seemingly large difference in
performance.
I suppose that boils down to the fact that it really does take a long
time to get the data onto the platter. If the write just writes to the
track cache, it never has to pay the penalty of seek time and
rotational delay, which can (I presume one would assert) be a
significant multiple of the cached write. So, in other words, we've
gotten very used to the speed of the cached write, without really
thinking too hard about the data safety issues.
-jdb
On Feb 21, 2005, at 1:07 PM, Kevin Van Vechten wrote:
It sounds like you're comparing apples and oranges (pardon the pun) .
As Dominic pointed out, F_FULLFSYNC offers a level of synchronization
that's not available from fsync on Mac OS X or other platforms. If
you were to only rely on fsync, I imagine the performance between
systems would be comparable.
One way to mitigate this issue for large operations is to wrap your
statements in a single transaction. I believe this allows sqlite to
sync only after the entire operation is complete, instead of between
individual statements.
- Kevin
On Feb 21, 2005, at 9:58 AM, James Berry wrote:
Dominic,
Thanks for a great explanation. A similar thread has come up on the
sqlite list, claiming that using F_FULLFSYNC leads to dramatic
performance decreases on Darwin over other platforms. Any thoughts on
these issues?
From: email@hidden
Subject: [sqlite] sluggish operation on os x?
Date: February 21, 2005 12:44:03 AM PST
To: email@hidden
Reply-To: email@hidden
Finally getting my SQLite3 code working, I'm experiencing awfully
slow performance when writing individual data on OS X.
Linux (Ubuntu) is lightning fast, Win32 is.. tolerable, but OS X
really crawls. I've tried both with built-in sources, and the new
SQLite3 fink module. Same behaviour.
....
From: email@hidden
Subject: Re: [sqlite] sluggish operation on os x?
Date: February 21, 2005 9:40:27 AM PST
To: email@hidden
Reply-To: email@hidden
I noticed this as well, so I profiled my call and found sync was
taking forever. I removed the following fcntl call, rc = fcntl(fd,
F_FULLFSYNC, 0);. Performance was back to normal.
ck
On Feb 19, 2005, at 5:59 PM, Dominic Giampaolo wrote:
MySQL makes the following claim at:
http://dev.mysql.com/doc/mysql/en/news-4-1-9.html
"InnoDB: Use the fcntl() file flush method on Mac OS X versions 10.3
and up. Apple had disabled fsync() in Mac OS X for internal disk
drives, which caused corruption at power outages."
First of all, is this accurate? A pointer to some docs or a tech
note
on this would be helpful.
The comments about fsync() are wrong...
On MacOS X, fsync() always has and always will flush all file data
from host memory to the drive on which the file resides. The
behavior
of fsync() on MacOS X is the same as it is on every other version of
Unix since the dawn of time (well, since the introduction of fsync
anyway :-).
I believe that what the above comment refers to is the fact that
fsync() is not sufficient to guarantee that your data is on stable
storage and on MacOS X we provide a fcntl(), called F_FULLFSYNC,
to ask the drive to flush all buffered data to stable storage.
Let me explain in more detail. With fsync() even though the OS
writes the data through to the disk and the disk says "yes I wrote
the data", the data is not actually on permanent storage. Unless
you explicitly disable it, all disks have a write buffer which holds
data you've written. The disk buffers the data you wrote until it
decides to flush it to the platters (and the writes may not be in
the order you wrote them). If you lose power or the system crashes
before the data is written, you can wind up in a situation where only
some of your data is actually on disk. What is worse is that even if
you write blocks A, B and C, call fsync() and then write block D you
may find after rebooting that blocks A and D are on disk but B and C
are not (in fact any ordering of A, B, C, and D is possible).
While this may seem like a rare case it is not. In fact if you sit
down and pull the plug on a system you can make it happen in one or
two plug pulls. I have even gone so far as to watch this behavior
with a logic analyzer on the ATA bus: I saw the data for two writes
come across the ATA cable, the drive replied and said the writes were
successful and then when we rebooted the data from the second write
was correct on disk but the data from the first write was not.
To deal with this we introduced the F_FULLFSYNC fcntl which will ask
the drive to flush all of its buffered data to disk. When an app
needs to guarantee that data is on disk it should use F_FULLFSYNC.
In most cases you do not need such a heavy handed operation and
fsync() is good enough. But in an app like a database, it is
essential if you want transactional integrity.
Now, a little bit more detail: on ATA drives we implement F_FULLFSYNC
with the FLUSH_TRACK_CACHE command. All drives sold by Apple will
honor this command. Unfortunately quite a few firewire drive vendors
disable this command and do not pass it to the drive. This means
that
most external firewire drives are not reliable if you lose power or
the system crashes. We can't work-around that unless we ask the
drive
to disable the write cache completely (which hurts performance quite
badly -- and even that may not be enough as some drives will ignore
that request too).
So in summary, I believe that the comments in the MySQL news posting
are slightly confused. On MacOS X fsync() behaves the same as it
does
on all Unices. That's not good enough if you really care about data
integrity and so we also provide the F_FULLFSYNC fcntl. As far as I
know, MacOS X is the only OS to provide this feature for apps that
need to truly guarantee their data is on disk.
Hope this clears things up.
--dominic
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
40opendarwin.org
This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Darwin-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden