Re: Writing to file as UTF8 with BOM ?
Re: Writing to file as UTF8 with BOM ?
- Subject: Re: Writing to file as UTF8 with BOM ?
- From: Richard Rönnbäck <email@hidden>
- Date: Fri, 27 Oct 2006 11:18:30 +0200
- Thread-topic: Writing to file as UTF8 with BOM ?
Thank you Mark.
That is an excellent explanation!
// Richard
> Från: "Mark J. Reed" <email@hidden>
> Datum: Thu, 26 Oct 2006 10:38:35 -0400
> Till: Yvon Thoraval <email@hidden>
> Kopia: AS Users <email@hidden>
> Ämne: Re: Writing to file as UTF8 with BOM ?
>
> On 10/26/06, Yvon Thoraval <email@hidden> wrote:
>> then the BOM for UTF-8 isn't the same as UT-16 one, for files having the
>> same indianness ???
>
> OK, let's back up a bit and make sure we're all on the same page.
>
> What we're calling the BOM here is really just another Unicode
> character, U+FEFF, defined in the Unicode Standard as ZERO WIDTH
> NO-BREAK SPACE. I'll call it ZWNBSP for short. The idea behind the
> ZWNBSP is that if you put one between two characters, it will prevent
> a Unicode-aware hyphenation engine from breaking the word at that
> point.
>
> The Byte Order Mark, or BOM, is another name for that character, which
> grew out of a convention used by Unicode-aware applications. It's
> just that, a convention, albeit one with some intentional help from
> the Standard folks. It's based on these two facts:
>
> 1. Given the semantics of a ZWNBSP, it makes no sense to put one at
> the very beginning of a file.
> 2. U+FFFE is not a legal Uniocde character.
>
> Therefore, if you put a ZWNBSP at the beginning of a UTF-16-encoded
> file, that provides a handy way to tell whether it's UTF-16LE or
> UTF-16BE without changing the meaning of the text. If the first two
> bytes are (254, 255) in that order, then you know you have have
> UTF-16BE - because if it were UTF-16LE, then (254, 255) would
> represent U+FFFE, which is illegal. By the same logic, if you find
> (255, 254) in that order, then you have UTF-16LE.
>
> So that's the reason for the existence of the BOM. But it's not the
> only use of it.
>
> In Unicode as with everything else, it's difficult to move forward and
> make improvements in support for new standards while simultaneously
> maintaining backward compatibility. Application vendors really needed
> a way to indicate whether a file was Unicode or something else, and
> realized that the BOM could serve that role, too. After all, if a
> file starts with what looks like a BOM, there's a good chance it's
> Unicode text and not something else.
>
> The upshot is that a BOM does two things simultaneously:
>
> (1) flag a file as Unicode
> (2) identify the specific transformation format in use - including
> endianness, if applicable.
>
> UTF-16 is a 16-bit encoding. It tells you what 16-bit values go in
> the file. It has nothing to say about how those 16-bit values are
> divided into bytes. Hence, UTF-16BE and UTF-16LE.
>
> UTF-8, on the other hand, is, as the name implies, an 8-bit encoding.
> It's defined in terms of bytes, not 16-bit words, so the order of
> those bytes is fixed. You don't need a BOM to distinguish between
> some hypothetical *UTF-8LE and *UTF-8BE encodings. But it still makes
> sense to put a BOM in a UTF-8 file to identify that file as not only
> Unicode text, but specifically as UTF-8 text. In fact, the UTF-8
> version of the BOM, since it's 3 bytes instead of 2, is 256 times less
> likely than the UTF-16 BOM to appear randomly in data. It's therefore
> even closer to a guarantee that the file has UTF-8 text instead of
> something else.
>
> Unicode-aware apps should be prepared to deal with BOMs in any Unicode
> Transformation Format, which means they may need to examine up to four
> bytes:
>
> UTF-32BE: 00 00 FE FF
> UTF-32LE: FF FE 00 00
> UTF-16BE: FE FF xx xx
> UTF-16LE: FF FE xx xx
> UTF-8: EF BB BF xx
>
> --
> Mark J. Reed <email@hidden>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> AppleScript-Users mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
> edband.net
> Archives: http://lists.apple.com/mailman//archives/applescript-users
>
> This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/mailman//archives/applescript-users
This email sent to email@hidden