• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: converting text input in any encoding to unicode
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: converting text input in any encoding to unicode


  • Subject: Re: converting text input in any encoding to unicode
  • From: arekkusu <email@hidden>
  • Date: Sun, 27 Apr 2003 14:29:08 -0700

On Sunday, April 27, 2003, at 01:31 PM, email@hidden wrote:
i'm quite surprised on that. i'd have thought indicating which encoding
was a necessity for it to be readable. i guess it's a case of text
files moving from system to system - before that was so prevalent and
expected, it was common and reasonable for a system to assume the text
will be in it's own format, whatever that might have been, and that
kind of thing has hung around for a bit too long, maybe.

A very long time. All languages are more or less arbitrary.

Will caveman X's wall drawings make sense to caveman Y? How about English Morse code vs Japanese Morse code?

People generally just invent whatever they need at the time, often without regard to future consequences. Now that technology has made global communication easy, cross-language / cross-cultural / cross-platform / issues are more important. So we have groups like unicode.org to guide development of text encodings in a future-friendly manner.


saying that the encoding info location can vary even for plain text
files? in fact, do plain text files have encoding info as standard, or
is no encoding indication standard? i spose there could be variations
between different platform's plain text files maybe? please tell me
that's not the case. i've got a horrible feeling that that is the case.
there's no such things as a or the standard plain text format is there?

Define "standard" first. It's not easy because it's subjective.

The baseline "standard" text encoding you are likely to encounter is 7 bit ASCII, which defines just enough characters to display plain English text like I'm typing into this mail. No typographic niceties like curly quotes or em-dashes, no accented characters for European languages, and forget about more complex writing systems like Chinese or Arabic.

To create "files intended for humans... containing natural language", 7 bit ASCII isn't enough, so over the last few decades, everybody developed their own 8 and 16 bit extensions to ASCII independently. Look at the Foundation documentation for NSString "Constants", and then google search each type of encoding to see what it defines. Everything past character 127 is defined arbitrarily, according to whatever needs the developers had at the time.

Even something as simple as the linefeeds after each paragraph have three common "standards": Mac, Win, and Unix.

In short, before Unicode, there was no standard, and if you are opening a "plain text" file that does not have a Unicode BOM at the start, there is no way to tell programatically (for sure) what the encoding is. You need a human to look at it and see if it's corrupt.

Perhaps, in time, all text encoding will be Unicode. And perhaps we will replace gasoline with solar power eventually, too. In the meantime, you have to deal with encoding conversions.


how can i look at the raw contents of text files? when you open a text
file in say bbedit, you just get the text - is there a unix command
line tool that enables you to see the raw full contents of a file?

To see the true raw content, read the man page for hexdump. Any decent text editor like BBEdit is doing at least some minimal interpretation of linefeeds, and likely sniffing the encoding too.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.
  • Follow-Ups:
    • setFrame:display:animate bottom up instead of top down?
      • From: Ben Mackin <email@hidden>
  • Prev by Date: Re: Big grinding memory leak?
  • Next by Date: Re: subclassing NSColorWell
  • Previous by thread: Re: converting text input in any encoding to unicode
  • Next by thread: setFrame:display:animate bottom up instead of top down?
  • Index(es):
    • Date
    • Thread