• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Writing to file as UTF8 with BOM ?
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Writing to file as UTF8 with BOM ?


  • Subject: Re: Writing to file as UTF8 with BOM ?
  • From: Christopher Nebel <email@hidden>
  • Date: Thu, 26 Oct 2006 12:54:56 -0700

On Oct 26, 2006, at 8:17 AM, Yvon Thoraval wrote:

Mark J. Reed wrote:

UTF-8, on the other hand, is, as the name implies, an 8-bit encoding.
It's defined in terms of bytes, not 16-bit words, so the order of
those bytes is fixed. You don't need a BOM to distinguish between
some hypothetical *UTF-8LE and *UTF-8BE encodings. But it still makes
sense to put a BOM in a UTF-8 file to identify that file as not only
Unicode text, but specifically as UTF-8 text. In fact, the UTF-8
version of the BOM, since it's 3 bytes instead of 2, is 256 times less
likely than the UTF-16 BOM to appear randomly in data. It's therefore
even closer to a guarantee that the file has UTF-8 text instead of
something else.

i thought UTF-8 could be guessed (successfully) from the content of the file isn't it ?

The key word there is "guessed". Without a BOM, you can't tell whether or not the file is UTF-8 without examining the entire contents, which is inefficient and may even be impossible in some situations. (Strictly speaking, a leading BOM isn't proof either, since it could just be there as random data, but as Mr. Reed points out, it's pretty unlikely. That was an excellent explanation, Mark; thank you.)


To sum up: yes, UTF-8 BOMs aren't commonly used, because one of the main uses of UTF-8 is as a format that will work (at least some) with completely Unicode-ignorant applications like, say, grep(1). However, it's still useful to Unicode-aware protocols because it can serve as a signature that the following data is UTF-8, as opposed to some sort of legacy encoding. It's up to the protocol definition whether or not it wants to insist on having a BOM, and there is nothing necessarily wrong with any of the possible choices. For further reading, I recommend <http://www.unicode.org/faq/utf_bom.html>.


--Chris Nebel AppleScript Engineering

_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/mailman//archives/applescript-users

This email sent to email@hidden
References: 
 >Re: Writing to file as UTF8 with BOM ? (From: Richard Rönnbäck <email@hidden>)
 >Re: Writing to file as UTF8 with BOM ? (From: "Mark J. Reed" <email@hidden>)
 >Re: Writing to file as UTF8 with BOM ? (From: Yvon Thoraval <email@hidden>)
 >Re: Writing to file as UTF8 with BOM ? (From: "Mark J. Reed" <email@hidden>)
 >Re: Writing to file as UTF8 with BOM ? (From: Yvon Thoraval <email@hidden>)
 >Re: Writing to file as UTF8 with BOM ? (From: "Mark J. Reed" <email@hidden>)
 >Re: Writing to file as UTF8 with BOM ? (From: Yvon Thoraval <email@hidden>)

  • Prev by Date: Re: Entourage Rule to modify incoming email message source
  • Next by Date: Re: "~" vs. "POSIX file"
  • Previous by thread: Re: Writing to file as UTF8 with BOM ?
  • Next by thread: Re: Writing to file as UTF8 with BOM ?
  • Index(es):
    • Date
    • Thread