Re: How to detect string encoding before reading a file in NSString?
Re: How to detect string encoding before reading a file in NSString?
- Subject: Re: How to detect string encoding before reading a file in NSString?
- From: Andrew Thompson <email@hidden>
- Date: Tue, 26 Apr 2011 16:26:39 -0400
Another battle tested piece of code would be Mozilla's sniffer, if external libraries and it's license suit you.
This document is out of date, bur explains the ideas.
http://www.mozilla.org/projects/intl/detectorsrc.html
On Apr 26, 2011, at 3:39 PM, John Pannell <email@hidden> wrote:
> Hi Laurent-
>
> I have an app that collects a lot of text off the web; my string creation algorithm is something like the following:
>
> 1. Attempt to create an NSString with NSUTF8StringEncoding.
> 2. If the string is nil, attempt to create the string using the encoding returned from the server.
> 3. If string is still nil, ask the Text Encoding Conversion Manager to sniff out the encoding from the data.
> 3a. This returns an array of likely encodings. For each item in the array:
> 3b. Attempt to create a string with the encoding.
>
> There was a little too much code associated with this to copy/paste into email, but I'd be happy to share... I have a wrapper object for the needed interaction with the Text Encoding Conversion Manager. Some more about it:
>
> http://developer.apple.com/library/mac/#documentation/Carbon/Reference/Text_Encodin_sion_Manager/Reference/reference.html#//apple_ref/doc/uid/TP30000123
>
> Hope this helps!
>
> John
>
>
> John Pannell
> http://www.positivespinmedia.com
>
> On Apr 26, 2011, at 12:53 PM, Nick Zitzmann wrote:
>
>>
>> On Apr 26, 2011, at 12:49 PM, Laurent Daudelin wrote:
>>
>>>> TextEdit's encoding guesser just uses the built-in NSAttributedString method -initWithURL:options:documentAttributes:error:, which will guess the file's encoding when opening it. But it has been mentioned that heuristics are not infallible, and this method's heuristics are no exception. It does a good job overall, but I've found that it usually misinterprets UTF-8 format text.
>>>
>>> Yes, I know that all the guess jobs can fail. I was starting to be excited when started reading your reply but if it usually misinterprets UTF-8, that's a pretty significant problem...
>>
>> That was a long time ago, so it may have been fixed. But if it's still happening, then one workaround would be to try and open the file as UTF-8 first, and if that fails, then fall back on the above method. The UTF-8 parser often returns nil on text that is not in UTF-8 format IIRC.
>>
>
> _______________________________________________
>
> Cocoa-dev mailing list (email@hidden)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
>
> This email sent to email@hidden
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden