Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Mac Character encoding



<email@hidden> wrote:

>Users enter object names or nouns into my java app's
>interface

As in "dog", "cat", "banana daiquiri"?

> as well as some data related to those name.

As in "train", "herd", "how to make"?


>I need to represent those names in HTML as well as
>individual files that contain data related to those
>names. Each name (object) has it's own HTML file
>associated to it.

It seems to me like you have a straightforward transformation problem.  You
need to take arbitrary text input in any language representable by Unicode,
and turn it into a name in a smaller alphabet, one that's safe and
unambiguous in URLs.

I can think of several ways to do that, depending on how perfect the
transformation has to be, how collision-free, etc.

Simplest example: strip all accents.

While it's conceptually simple, it's no simple trick because there's
nothing in Java that does exactly that.

A simple strategy is to map composed-accent Unicode chars to their
unadorned Latin-alphabet letters (a one-to-one lossy transformation).
Composing-accent chars are simply omitted.  Alphabets other than Latin A-Z
get transformed into something in Latin A-Z.

A slightly more complicated example is to substitute multi-char sequences
for accented letters.  For example, use the digits 0-9 to signify 10
different accents.  Say acute is 0 and grave is 1.  So e-acute is then "e0"
and e-grave is "e1".  By "e0" I mean the two-char sequence 'e' followed by
'0'.  I don't mean a hex byte 0xE0 or 0xE1.  Since there are more than 10
combining accents, you'll have to use a larger substition alphabet, but you
get the idea.

For the completely general case, "mapping" devolves into expanding each
Unicode char into a multi-letter sequence, as in small-thorn (\u00FE) is
turned into the sequence "u00fe".  Yes, it's a 5X expansion, but it's
collision-free.

The pivotal idea is that you can turn any sequence of Unicode chars into
any other sequence of chars in a smaller alphabet, using whatever
transformation you need, meeting whatever goals you define.  But you'd
better define the goals very carefully, or you won't get what you need.

If you confine yourself to simple transformations, then you shift the
burden onto the browser or whatever is interpreting the HTML and URLs.  If
you expand the transformations (e.g. one char may map to a multi-char
series), then you put more work up front, but you simplify what the
browsers have to deal with.

It all depends on exactly what you need from the mappings: perfectly
invertible, perfectly collision-free, readability for humans, brevity for
common English words, case-insensitivity, whatever.


>I need to find a way to access those
>named HTML files as URLs as part of the output. So far
>it's working ok, however some of the browsers are
>failing on some of the URL filenames that are using
>the utf-8 charset.

It probably would fail, especially if you aren't doing anything special to
the multi-byte UTF-8 encoded chars.  Are you URL-escaping them?

What program is reading this HTML and having to resolve the URLs you've
written in there?

Different browsers may not have exactly the same URL-de-escaping algorithm.

  -- GG


 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.