Working with an Unsupported Character Encoding (ANSEL)
Working with an Unsupported Character Encoding (ANSEL)
- Subject: Working with an Unsupported Character Encoding (ANSEL)
- From: Thomas Wetmore <email@hidden>
- Date: Sat, 03 Oct 2009 11:11:56 -0400
I am writing software to handle GEDCOM files. These files are usually
in ASCII format, though some are in ANSEL format (the format they are
supposed to be in), and in recent years more and more are in UNICODE
encodings. A GEDCOM file is supposed to include an attribute that
specifies its character set, but, as in HTML files, they are not
always there, or if there, they are not always correct. And if they
are in UNICODE the attribute does not specify the specific encoding.
ANSEL is an 8-bit encoding where the lower half is ASCII and the upper
half includes some non-spacing diacritics as well as a few specific
Latin letter and diacritic combinations. There is no Cocoa/NSString/
CFString support for ANSEL that I have found.
My current approach is to read the file using NSASCIIStringEncoding
and to then determine the encoding of the file by scanning through it.
I decided to do this since most files are indeed ASCII so in most
cases no further I/O or character conversion is needed.
While scanning the file I look for the attribute that specifies what
the file should be, but I also do other checks. For example I check
whether any of the upper half bytes are illegal ANSEL. And I check for
UTF-8 multi-byte encodings. At the end I know whether the file is
either valid ASCII, not ASCII but valid ANSEL, not ASCII or ANSET but
vaild UTF-8, and if it's not valid as any of those three I assume it's
UTF-16.
If the file is UTF-8 or UTF-16 I can just reread it with the correct
encoding. However, if it is ANSEL I must do some delicate fiddling to
convert it to Unicode.
I am relatively new to Cocoa and NSStrings, so this has lead to a few
questions.
1. Apparently reading a file to an NSString using the
NSASCIIStringEncoding returns each of the bytes of the file exactly as
they were, that is, the 8-bit bytes seem to be read exactly as they
were. So is it true that reading with NSASCIIStringEncoding doesn't
mess around with any of the 8-bit bytes in the file?
2. Given I have an NSString that I read in as NSASCIIStringEncoding
but I later determine it should have been read as UTF8 or UTF16, can I
transform that NSString in place, or must I reread the file with the
proper encoding? I don't mind doing the latter, but if there is
conversion solution it would have better performance.
3. I'm imagining two ways to do the ANSEL to UNICODE transformation to
get the NSString.
a. Create a C-array of 16-bit shorts and convert the ANSEL to pure
UNICODE. Is there an API to convert a such a C-array of 16-bit shorts
to an NSString?
b. Create a new NSString directly by building it up character by
character. Would performance suffer greatly over the former approach?
c. Is there an easier approach I am not seeing?
Thanks very much for any advice.
Tom Wetmore
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden