Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: AWT FileDialog and Unicode



Daniel Bobbert <email@hidden> wrote:

>- the java.awt.FileDialog peer returns the UTF8 filename from HFS+
>- now here comes the bug: instead of decoding UTF8, MRJ on X pushes
>those raw UTF8 bytes through MacRoman.
>
>This leads to the following consequences:
>- if the filename contains only characters that are representable by
>MacRoman, then my workaround can be used to reverse the MacRoman
>encoding and then corectly decode the raw UTF8 bytes.
>- if the filename contains chars, that are not representable by
>MacRoman, TEC obviously produces a numeric representation from which we
>dont know what it means (#15056 doent look like four characters to me
>though). Anyhow in this case there doese not seem to be a way to reverse
>the conversion.

I thought exactly the same thing at first, but I don't think it's what's
happening in all cases. From what I could tell, the problems were not
coming JUST from UTF-8 bytes that were unrepresentable in MacRoman. It was
as if the original UniCode on disk first had to survive a round-trip
through MacRoman, was then turned into UTF-8 (per RFC 2044), and then
de-MacRomanized again into the mangled form that FileDialog returns.

The case I cited of the <1/4> glyph was a bad example. It's:
\u00BC -> 0xC2,0xBC

The 0xC2 is MacRomanizable, but the 0xBC isn't, so the conversion obviously
barfs. The distinctive way that UTF-8 works makes all chars \u00A0-\u00BF
have a 2nd byte with the same bit-pattern as the original char. There are
11 UniCode chars in that range that are not MacRomanizable, and their
placement in the relevant code-spaces makes a nice little mine-field to be
traversed.

To keep things interesting, however, the mines are sometimes disabled. Take:
\u00FC -> 0xC3,0xBC

You'd think it will fail, based on the above analysis. But it works, and
is successfully returned by FileDialog, which I then re-MacRomanize and
de-UTF8 using your work-around, resulting in the u-dieresis you
specifically cited. <bullwinkle>Nothing up my sleeve.</bullwinkle>

Something very subtle is going on. Take these three UniCode chars and
their UTF-8 encodings:
\u00D0 -> 0xC3,0x90
\u00D5 -> 0xC3,0x95
\u00F0 -> 0xC3,0xB0
\u00F5 -> 0xC3,0xB5

The first 3 fail to be returned by FileDialog. The last one works (i.e. is
returned as MacRomanized UTF-8).

The byte 0xC3 is UniCode/8859-1's A-tilde, which has a MacRoman
representation, so it's fine. Bytes 0x90 and 0x95 are "ctrl" chars, so who
knows what they MacRomanize into. But before you follow that as a clue,
consider that \u00D1 (N-tilde) UTF-8's into 0xC3,0x91, which works fine
with your work-around.

The really interesting cases are:
\u00F0 -> 0xC3,0xB0
\u00F5 -> 0xC3,0xB5

The first should get MacRomanized into A-tilde, degree-symbol. It doesn't.
The second should get MacRomanized into A-tilde, micro-symbol (mu). It
does.

What's different? The original char \u00F0 is a partial-derivative-symbol
(or something like that) with a crossbar (what language is that anyway?),
which does not have a MacRoman representation. But \u00F5 is small-o with
tilde, which has a MacRoman representation. Puzzlingly, though, \u00D5 is
capital-O with tilde, which is not MacRomanizable, and it fails as noted
above.

So while I agree that MacRomanizing the UTF-8 bytes is MUCH of the problem,
I don't think it's ALL of the problem. Or I could have missed something in
my tests and analysis (with all the mangling and transcodings involved,
it's mind-numbing).

In any case, FileDialog is still hopelessly and unrecoverably broken. The
patient is just as dead, and one cause of death is just as fatal as any
other.

-- GG




Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.