• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: What is Best Practice for Reading Files of Unknown Encoding?
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: What is Best Practice for Reading Files of Unknown Encoding?


  • Subject: Re: What is Best Practice for Reading Files of Unknown Encoding?
  • From: Shane Stanley <email@hidden>
  • Date: Tue, 21 Jun 2016 18:04:56 +1000

On 21 Jun 2016, at 2:16 PM, Jim Underwood <email@hidden> wrote:

So, what is the best practice when reading a file in AppleScript that you did not write, and you do not have any authoritative information about its encoding?

You don't have much choice. UTF-16 isn't used much, and AS only reads UTF-8 or MacRoman (actually, I think that's MacRoman on English-language systems -- I'm not sure). So the best you can do is try UTF-8, and if you get an error, fall back to MacRoman and hope. One of the good things about UTF-8 is that if it uses any characters outside the base set that are common to most encodings (the 128 ASCII characters) they are encoded in a particular way. This means that if you try to open a file saved in another encoding, and it uses more than the basic roman characters, there's almost no chance of it succeeding.

If you're prepared to use ASObjC, there is a method that tries to guess. It was only introduced in 10.10, and I think it had some problems before 10.11. You read the file as data and pass it to the method.

The method takes a dictionary of encoding options, which are basically hints about how you want the conversion done. For example, you can specify that you never want a lossy conversion. You can also specify the likely language if you know it. You can specify if it's likely to have come from Windows. And you can give it a list of encodings to prefer.

Here's an example with two options: don't do a lossy conversion, and it's probably an English document:

use AppleScript version "2.5"
use scripting additions
use framework "Foundation"

set aPOSIXpath to POSIX path of (choose file)
set anNSData to current application's NSData's dataWithContentsOfFile:aPOSIXpath
set theOptions to current application's NSDictionary's dictionaryWithObjects:{false, "en"} forKeys:{current application's NSStringEncodingDetectionAllowLossyKey, current application's NSStringEncodingDetectionLikelyLanguageKey}
set {theEncoding, theString, wasLossy} to current application's NSString's stringEncodingForData:anNSData encodingOptions:theOptions convertedString:(reference) usedLossyConversion:(reference)
if theEncoding is 0 then
-- it was an error
end if

if you know the file is from Windows, you could use these options instead:

set theOptions to current application's NSDictionary's dictionaryWithObjects:{false, "en", true} forKeys:{current application's NSStringEncodingDetectionAllowLossyKey, current application's NSStringEncodingDetectionLikelyLanguageKey, current application's NSStringEncodingDetectionFromWindowsKey}

-- 
Shane Stanley <email@hidden>
<www.macosxautomation.com/applescript/apps/>

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

References: 
 >What is Best Practice for Reading Files of Unknown Encoding? (From: Jim Underwood <email@hidden>)

  • Prev by Date: What is Best Practice for Reading Files of Unknown Encoding?
  • Next by Date: Re: en-dash and em-dash
  • Previous by thread: What is Best Practice for Reading Files of Unknown Encoding?
  • Next by thread: Re: What is Best Practice for Reading Files of Unknown Encoding?
  • Index(es):
    • Date
    • Thread