Re: Writing to file as UTF8 with BOM ?
Re: Writing to file as UTF8 with BOM ?
- Subject: Re: Writing to file as UTF8 with BOM ?
- From: "Mark J. Reed" <email@hidden>
- Date: Thu, 26 Oct 2006 10:38:35 -0400
On 10/26/06, Yvon Thoraval <email@hidden> wrote:
then the BOM for UTF-8 isn't the same as UT-16 one, for files having the
same indianness ???
OK, let's back up a bit and make sure we're all on the same page.
What we're calling the BOM here is really just another Unicode
character, U+FEFF, defined in the Unicode Standard as ZERO WIDTH
NO-BREAK SPACE. I'll call it ZWNBSP for short. The idea behind the
ZWNBSP is that if you put one between two characters, it will prevent
a Unicode-aware hyphenation engine from breaking the word at that
point.
The Byte Order Mark, or BOM, is another name for that character, which
grew out of a convention used by Unicode-aware applications. It's
just that, a convention, albeit one with some intentional help from
the Standard folks. It's based on these two facts:
1. Given the semantics of a ZWNBSP, it makes no sense to put one at
the very beginning of a file.
2. U+FFFE is not a legal Uniocde character.
Therefore, if you put a ZWNBSP at the beginning of a UTF-16-encoded
file, that provides a handy way to tell whether it's UTF-16LE or
UTF-16BE without changing the meaning of the text. If the first two
bytes are (254, 255) in that order, then you know you have have
UTF-16BE - because if it were UTF-16LE, then (254, 255) would
represent U+FFFE, which is illegal. By the same logic, if you find
(255, 254) in that order, then you have UTF-16LE.
So that's the reason for the existence of the BOM. But it's not the
only use of it.
In Unicode as with everything else, it's difficult to move forward and
make improvements in support for new standards while simultaneously
maintaining backward compatibility. Application vendors really needed
a way to indicate whether a file was Unicode or something else, and
realized that the BOM could serve that role, too. After all, if a
file starts with what looks like a BOM, there's a good chance it's
Unicode text and not something else.
The upshot is that a BOM does two things simultaneously:
(1) flag a file as Unicode
(2) identify the specific transformation format in use - including
endianness, if applicable.
UTF-16 is a 16-bit encoding. It tells you what 16-bit values go in
the file. It has nothing to say about how those 16-bit values are
divided into bytes. Hence, UTF-16BE and UTF-16LE.
UTF-8, on the other hand, is, as the name implies, an 8-bit encoding.
It's defined in terms of bytes, not 16-bit words, so the order of
those bytes is fixed. You don't need a BOM to distinguish between
some hypothetical *UTF-8LE and *UTF-8BE encodings. But it still makes
sense to put a BOM in a UTF-8 file to identify that file as not only
Unicode text, but specifically as UTF-8 text. In fact, the UTF-8
version of the BOM, since it's 3 bytes instead of 2, is 256 times less
likely than the UTF-16 BOM to appear randomly in data. It's therefore
even closer to a guarantee that the file has UTF-8 text instead of
something else.
Unicode-aware apps should be prepared to deal with BOMs in any Unicode
Transformation Format, which means they may need to examine up to four
bytes:
UTF-32BE: 00 00 FE FF
UTF-32LE: FF FE 00 00
UTF-16BE: FE FF xx xx
UTF-16LE: FF FE xx xx
UTF-8: EF BB BF xx
--
Mark J. Reed <email@hidden>
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/mailman//archives/applescript-users
This email sent to email@hidden