• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Reading Middle Eastern Characters
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reading Middle Eastern Characters


  • Subject: Re: Reading Middle Eastern Characters
  • From: Christopher Nebel <email@hidden>
  • Date: Mon, 6 Dec 2004 23:11:29 -0800

On Dec 6, 2004, at 10:30 PM, Ferenc Farkas MÁTYÁS wrote:

and European users), not UTF-8. If you say "read ... as <<class utf8>>", you'll get the right result, but see the next bit. (The << and >> are chevrons, which won't go through the mailing list correctly. Ritual cursing of the list here.)

How can one determine if the text is in utf or macroman or in another encoding? The reason I am asking this is if the text is utf8, it reads it well, but if it's not, I get en empty output from read. I can test it, if it's empty or not, but it's an ugly workaround I think.

Believe it or not, that's about the best you can do, and is actually what's recommended in many cases. Because of how UTF-8 is structured, it's unlikely you'll have data that looks like UTF-8 but isn't, so the usual technique is to try to interpret the data as UTF-8 first; if that fails, then fall back to some other encoding, usually the system-primary one. Doing this by reading the file twice isn't particularly efficient, but given AppleScript's facilities for this sort of thing, you're a bit stuck for it.


Correctly determining which of the dozens of conceivable text encodings a given hunk of data uses is essentially an AI-complete problem -- that is, you need human-level intelligence, and even most humans would have a hard time with some of the fringe cases.

If you're in a position to dictate the encodings you'll handle, then by all means do so. Many folks will automatically handle UTF-16, since a UTF-16 data file will always start with 0xfeff (or 0xfffe for byte-swapped UTF-16), and punt on everything else.


--Chris Nebel AppleScript Engineering _______________________________________________ Do not post admin requests to the list. They will be ignored. Applescript-users mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: This email sent to email@hidden
  • Follow-Ups:
    • Re: Reading Middle Eastern Characters
      • From: Elliotte Harold <email@hidden>
References: 
 >Re: Reading Middle Eastern Characters (From: Ferenc Farkas MÁTYÁS <email@hidden>)

  • Prev by Date: Re: Reading Middle Eastern Characters
  • Next by Date: RE: Style spec creation/modification in Xpress
  • Previous by thread: Re: Reading Middle Eastern Characters
  • Next by thread: Re: Reading Middle Eastern Characters
  • Index(es):
    • Date
    • Thread