Re: converting text input in any encoding to unicode
Re: converting text input in any encoding to unicode
- Subject: Re: converting text input in any encoding to unicode
- From: "Clark S. Cox III" <email@hidden>
- Date: Mon, 28 Apr 2003 08:39:47 -0400
On Sunday, Apr 27, 2003, at 10:17 US/Eastern, Ben Dougall wrote:
On Sunday, April 27, 2003, at 01:57 pm, Clark Cox III wrote:
On Sunday, April 27, 2003, at 07:32AM, Ben Dougall
<email@hidden> wrote:
what's the best / usual way from a cocoa app to read in text that's
potentially encoded with any encoding, in order to store it
internally
in your app in decomposed unicode? i'd like to be able to deal with
as
many encodings as possible - and convert them to the base decomposed
unicode format in order to compare different texts confidently.
In order to do that, you'd need to have some idea of what encoding
the text is in. You can try to discern some encodings, but others
will be impossible to differentiate just from looking at the text
itself.
surely most (all?) text files not only contains which characters it
contains but which encoding they're in? i'd have thought that was a
standard requirement for text?
No, almost no text files have encoding information in them.
You can usually identify Unicode text via the BOM, and you can be
pretty sure that if the text does not contain any bytes that are
greater than 127, then it can be interpreted as ASCII. Other than
that, you'd some other hint as to the text's encoding.
unicode is one char encoding out of goodness knows how many. i guess
different text systems have different methods for indicating which
char encoding? html and xml indicate within the text itself which
encoding it's in. i'd have thought all other text formats also
indicate which enoding they're in, in one way or another - i guess 'in
one way or another' is a stumbling block maybe. but there must be an
already existing method to do that to a reasonable extent?
No, there is no method to reliably find the encoding of a raw text
file. Like I said before, generally, the only encodings that can be
readily identified are ASCII and Unicode (and only with a Byte Order
Mark).
there's CFString stuff - is that the usual way to go about doing this?
NSString/CFString (they really are the same thing) can only help you
if you have a good idea of the encoding. For example, in a project that
I'm working on, I'm porting a NextStep application to OS X. So, the
routine that I wrote to read in strings from data files goes through
the following algorithm:
Attempt to interpret the string as UTF-8 unicode
If that fails, interpret it as NextStep encoding
But this only works because we know that the NextStep version always
wrote out NextStep encoding, and the mac version always writes out
UTF-8.
--
http://homepage.mac.com/clarkcox3/
email@hidden
Clark S. Cox, III
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.