Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: String Encoding Detection (Revisited)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: String Encoding Detection (Revisited)

Subject: Re: String Encoding Detection (Revisited)
From: Francisco Tolmasky <email@hidden>
Date: Thu, 7 Aug 2003 15:33:24 -0700

I have been looking the the sniffer stuff, and have a question. Is TextEncoding the same thing as NSStringEncoding. Not are they the same data type, like unsigned, but will a value in TextEncoding be analogous to one with the same number as an NSStringEncoding. Like if 0x029393 is Mac OS Roman in TextEncoding, is it also 0x029393 as an NSStringEncoding?

On Thursday, August 7, 2003, at 12:00 PM, Dustin Voss wrote:

On Thursday, August 7, 2003, at 10:11 AM, Dustin Voss wrote:

On Thursday, August 7, 2003, at 01:44 AM, Francisco Tolmasky wrote:

Ok, so I recently posted a question about auto-detecting string encodings, and also looked through the archives. Basically there's no way unless it is unicode and has a BOM. I still want an auto-detect feature though, like BBEdit's. So basically, how do I check for a BOM (I check TextEdit's code, couldn't find it, found lots of other stuff though). Anyways, other than that and doing some weird spell checking thing someone suggested (Using spellchecker to see if the string makes sense or not, which would be pretty useless if it's code or anything other than pure sentences), are there any other "tricks"?

And when all else fails and I resort to just using an encoding, which one should I choose mac os roman, ascii, utf-8?

I don't know about tricks, but the BOM will be one of the following:
UTF-16 BE: FE FF
UTF-16 LE: FF FE
UTF-8: EF BB BF

Additionally, valid UTF-8 follows these rules (http://www.faqs.org/rfcs/rfc2279.html):
1. Any byte C0-FD will always be followed by at least one byte 80-BF.
2. Any byte 00-7F will never be followed by a byte 80-BF.
3. There will never be a byte FE or a byte FF.

Valid UTF-16 follows these rules (http://www.ietf.org/rfc/rfc2781.txt) once endian-ness is taken care of:
1. A word D800-DBFF will always be followed by one word DC00-DFFF.
2. A word DC00-DFFF will only follow a word D800-DBFF.
3. Words FFFF and FFFE are invalid.

There is some open-source Unicode validation code at http://oss.software.ibm.com/icu/. It is more thorough than the above, since I think it covers unassigned characters and invalid surrogate pairs as well.

I now return you to your regularly scheduled programming. (Heh. "Programming.")

Francisco Tolmasky
email@hidden
http://users.adelphia.net/~ftolmasky
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: String Encoding Detection (Revisited)
  - From: Dustin Voss <email@hidden>

References:
	>Re: String Encoding Detection (Revisited) (From: Dustin Voss <email@hidden>)

Prev by Date: Re: Objective-C or C++ or something...
Next by Date: Re: IB Woes
Previous by thread: Re: String Encoding Detection (Revisited)
Next by thread: Re: String Encoding Detection (Revisited)
Index(es):
- Date
- Thread