Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Working with an Unsupported Character Encoding (ANSEL)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Working with an Unsupported Character Encoding (ANSEL)

Subject: Working with an Unsupported Character Encoding (ANSEL)
From: Thomas Wetmore <email@hidden>
Date: Sat, 03 Oct 2009 11:11:56 -0400

I am writing software to handle GEDCOM files. These files are usually in ASCII format, though some are in ANSEL format (the format they are supposed to be in), and in recent years more and more are in UNICODE encodings. A GEDCOM file is supposed to include an attribute that specifies its character set, but, as in HTML files, they are not always there, or if there, they are not always correct. And if they are in UNICODE the attribute does not specify the specific encoding.

ANSEL is an 8-bit encoding where the lower half is ASCII and the upper half includes some non-spacing diacritics as well as a few specific Latin letter and diacritic combinations. There is no Cocoa/NSString/ CFString support for ANSEL that I have found.

My current approach is to read the file using NSASCIIStringEncoding and to then determine the encoding of the file by scanning through it. I decided to do this since most files are indeed ASCII so in most cases no further I/O or character conversion is needed.

While scanning the file I look for the attribute that specifies what the file should be, but I also do other checks. For example I check whether any of the upper half bytes are illegal ANSEL. And I check for UTF-8 multi-byte encodings. At the end I know whether the file is either valid ASCII, not ASCII but valid ANSEL, not ASCII or ANSET but vaild UTF-8, and if it's not valid as any of those three I assume it's UTF-16.

If the file is UTF-8 or UTF-16 I can just reread it with the correct encoding. However, if it is ANSEL I must do some delicate fiddling to convert it to Unicode.

I am relatively new to Cocoa and NSStrings, so this has lead to a few questions.

1. Apparently reading a file to an NSString using the NSASCIIStringEncoding returns each of the bytes of the file exactly as they were, that is, the 8-bit bytes seem to be read exactly as they were. So is it true that reading with NSASCIIStringEncoding doesn't mess around with any of the 8-bit bytes in the file?

2. Given I have an NSString that I read in as NSASCIIStringEncoding but I later determine it should have been read as UTF8 or UTF16, can I transform that NSString in place, or must I reread the file with the proper encoding? I don't mind doing the latter, but if there is conversion solution it would have better performance.

3. I'm imagining two ways to do the ANSEL to UNICODE transformation to get the NSString. a. Create a C-array of 16-bit shorts and convert the ANSEL to pure UNICODE. Is there an API to convert a such a C-array of 16-bit shorts to an NSString? b. Create a new NSString directly by building it up character by character. Would performance suffer greatly over the former approach? c. Is there an easier approach I am not seeing?

Thanks very much for any advice.

Tom Wetmore
_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden

Follow-Ups:

Re: Working with an Unsupported Character Encoding (ANSEL)
From: Jens Alfke <email@hidden>
Re: Working with an Unsupported Character Encoding (ANSEL)
From: "Adam R. Maxwell" <email@hidden>

Prev by Date:
Re: Keeping NSWindow below all other windows

Next by Date:
Probs with "BetterAuthorizationSample"-code

Previous by thread:
Re: main nib,  firing a secondary nib and it's controller....

Next by thread:
Re: Working with an Unsupported Character Encoding (ANSEL)

Index(es):

Date
Thread