Re: text encodings
Re: text encodings
- Subject: Re: text encodings
- From: Dan Wood <email@hidden>
- Date: Fri, 22 Nov 2002 07:59:53 -0800
we're currently writing a text-conversion plug-in for our app, and are
quite unsure which encoding to choose. :)
we got plaintext to start with and would like to export these texts to
unix-, mac-, windows-, unicode-format. now... these are our options:
You'll probably best off supporting all of the formats, and letting the
user decide what they want. Everybody is going to have different
needs; perhaps you could take the most common ones and present them
first. (If you look in the documentation and header comments, you'll
probably find the official names of these encodings.)
NSASCIIStringEncoding = 1,
This is just plain ASCII text, with no high-bit characters. If your
text contains European accent marks like "risumi" or is in any language
other than English, you don't want this unless your user needs the text
to be 7-bit ASCII.
NSNEXTSTEPStringEncoding = 2,
This is an old encoding used by NeXT computers -- probably not in much
demand now, but there anyhow.
NSJapaneseEUCStringEncoding = 3,
NSShiftJISStringEncoding = 8,
A couple of ways of encoding Japanese characters, useful if your text
might be Japanese.
NSUTF8StringEncoding = 4,
VERY useful -- this encodes any of the 7-bit ASCII characters as a
byte, and any other UNICODE characters get encoded over multiple bytes.
This allows text to be fully unicode, but look like ASCII when it's
just plain English. You can learn about this format with a bit of Web
searching.
NSISOLatin1StringEncoding = 5,
This is a common way for European characters to get encoded as high-bit
ASCII (values 128-255), and it's an international, cross-platform
standard. Many web sites deliver their text in this format.
NSSymbolStringEncoding = 6,
This is a way for symbol characters to be encoded. Don't know many
specifics on this.
NSNonLossyASCIIStringEncoding = 7,
AFAIK, this is sort of like 1... not too sure, I think it just enforces
no high-bit characters.
NSISOLatin2StringEncoding = 9,
Another way of encoding European characters; I haven't run across this
much.
NSUnicodeStringEncoding = 10,
This encodes all characters as two bytes each (for the most part);
there's also special marker bytes at the beginning of the stream/file
so that the program reading the text can figure out if it was encoded
in Hi-endian or Lo-endian byte order. This will result in a smaller
stream/file size if you are expecting lots of double-byte characters,
since those take up more space in UTF8 encoding, and a larger stream
size if you're pretty much using English characters, since you'd be
using two bytes for each character that only needs one.
NSWindowsCP1251StringEncoding = 11,
NSWindowsCP1252StringEncoding = 12,
NSWindowsCP1253StringEncoding = 13,
NSWindowsCP1254StringEncoding = 14,
NSWindowsCP1250StringEncoding = 15,
Various Windows-specific encodings. 12 is the most common for
English/European text that I've seen; most web sites that are hosted on
a Windows machine tend to deliver their content in that format.
NSISO2022JPStringEncoding = 21,
Not sure off the top of my head, I think this is another Japanese
encoding.
NSMacOSRomanStringEncoding = 30,
This is the default encoding for the Mac, to hold English/European
characters. On the Mac, if you open a file and there is no way to
guess the encoding, this is the encoding it will try.
NSProprietaryStringEncoding = 65536
This would be used if you had some other encoding.... haven't run
across this in practical use.
----snip----
some seem obvious, but some just don't. :)
anybody care to shed some light on this?
or are we heading in a completely wrong direction?
I think you're probably doing the right thing. Take a look at how
TextEdit works for opening files and saving files, that should give you
and idea of what makes sense. If you are reading text that comes from
an arbitrary source, you need to support as many encoding formats as
possible. Internally, the text will be stored as unicode. Then, if
you are writing it out, and the user might want to write it out in a
different format, you should give them plenty of options.
Also, be sure to handle Unix, mac, and Windows end-of-line
encodings.... \n, \r, and \r\n respectively. Somebody opening a file
in Mac format and saving it in Windows format is going to want to have
your program do the right thing....
--
Dan Wood
Karelia Software, LLC
email@hidden
http://www.karelia.com/
Watson for Mac OS X:
http://www.karelia.com/watson/
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.