Re: Working with an Unsupported Character Encoding (ANSEL)
Re: Working with an Unsupported Character Encoding (ANSEL)
- Subject: Re: Working with an Unsupported Character Encoding (ANSEL)
- From: "Adam R. Maxwell" <email@hidden>
- Date: Sat, 03 Oct 2009 20:59:50 -0700
On Oct 3, 2009, at 8:11 AM, Thomas Wetmore wrote:
While scanning the file I look for the attribute that specifies what
the file should be, but I also do other checks. For example I check
whether any of the upper half bytes are illegal ANSEL. And I check
for UTF-8 multi-byte encodings. At the end I know whether the file
is either valid ASCII, not ASCII but valid ANSEL, not ASCII or ANSET
but vaild UTF-8, and if it's not valid as any of those three I
assume it's UTF-16.
Can you check for a Unicode BOM for UTF-16, too? Regardless, instead
of reading the file as an NSString, I'd recommend reading it into
NSData, particularly if you want to look at raw bytes. NSString is
not a generic byte container, and you can run into problems if you
specify an incorrect encoding.
If the file is UTF-8 or UTF-16 I can just reread it with the correct
encoding. However, if it is ANSEL I must do some delicate fiddling
to convert it to Unicode.
I am relatively new to Cocoa and NSStrings, so this has lead to a
few questions.
1. Apparently reading a file to an NSString using the
NSASCIIStringEncoding returns each of the bytes of the file exactly
as they were, that is, the 8-bit bytes seem to be read exactly as
they were. So is it true that reading with NSASCIIStringEncoding
doesn't mess around with any of the 8-bit bytes in the file?
I don't know if you can rely on this; NSData is safer, as I mentioned
above.
2. Given I have an NSString that I read in as NSASCIIStringEncoding
but I later determine it should have been read as UTF8 or UTF16, can
I transform that NSString in place, or must I reread the file with
the proper encoding? I don't mind doing the latter, but if there is
conversion solution it would have better performance.
No, you'd need to reread it. However, if you read it as NSData, you
can create the string using initWithData:encoding:.
3. I'm imagining two ways to do the ANSEL to UNICODE transformation
to get the NSString.
a. Create a C-array of 16-bit shorts and convert the ANSEL to pure
UNICODE. Is there an API to convert a such a C-array of 16-bit
shorts to an NSString?
NSString's initWithCharacters:length: will read a C array of Unichars
(UTF-16).
b. Create a new NSString directly by building it up character by
character. Would performance suffer greatly over the former approach?
I'd avoid that, unless you're dealing with small strings. If your
conversion operates at the character level, stick with C arrays or
NSMutableData; if you need to combine a unichar buffer with the
convenience of NSMutableString, you can use
CFStringCreateMutableWithExternalCharactersNoCopy, but that can be
tricky.
c. Is there an easier approach I am not seeing?
I noticed that CFStringEncoding lists kCFStringEncodingANSEL as an
external encoding, but CFStringIsEncodingAvailable returns false,
unfortunately. You could probably write a plugin for the Text
Encoding Converter, but I've never tried that myself.
http://developer.apple.com/mac/library/documentation/Carbon/Conceptual/ProgWithTECM/tecmgr_about/tecmgr_about.html
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden