Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dnd of filenames with locale specific characters



Ed Goulet wrote:

>Do you concur regarding the use of UTF-8?

No.  It makes no sense to me at all.  I don't even see where or why UTF-8
comes into the question.

A Java String is a sequence of 16-bit chars, not 8-bit bytes.  Each 16-bit
char is a code-point in the UTF-16 code.  C may have 8-bit chars, but Java
doesn't.  Java isn't C.

Even assuming fArray[i] is a String type, this code fragment makes no sense:
  byte[] theBytes = fArray[i].getBytes();
  String inUTF8 = new String(theBytes, "UTF8");

The variable named 'inUTF8' is not in fact encoded in UTF8.  Something very
different has happened, which I will examine in excruciating step-by-step
detail below.


First:

Some Latin-alphabet characters with accents, such as o-acute, have TWO
possible encodings in UTF-16 (or in UTF-8, for that matter).  One is the
"composed accent" form.  For o-acute, that's code position U+00F3.  The
second form is "decomposed accent" form, which is the unadorned letter code
(U+006F) followed by the combining acute accent (U+0301).  All
combining-accent forms apply to the prior character, and they "stack", so
you can have multiple ones applied to a single base character.

Now, if you use Finder to drop files or folders on a Java app, the
handleOpenFile() method receives an ApplicationEvent with a filename.  When
you call getFilename(), you receive the String ALREADY ENCODED in UTF-16.
You DO NOT receive UTF-8.

The String returned by getFilename() ALSO uses the decomposed accent form,
NOT the composed accent form.  So if you assume that each char represents
an ENTIRE character, you will be mistaken.  Some chars are represented by
an unadorned character followed by a composing accent.

The reason for this is that HFS+ stores accented chars in decomposed form:
  <http://developer.apple.com/technotes/tn/tn1150table.html> -- table

However, if you use these Strings as filenames, to make File objects, they
will work.  You can call File.exists() or new FileInputStream(File) and it
will all work.  At least it does in all the versions of Java 1.3, 1.4, or
1.5 I've tested this on.

If you have evidence of decomposed accents not working in filenames, please
post the exact versions of Java and Mac OS X the problem appears in.


Second:

Say you do this to the String from ApplicationEvent.getFilename():
  String filename = theEvent.getFilename();
  byte[] theBytes = filename.getBytes();
  String inUTF8 = new String(theBytes, "UTF8");

Let's look at each line in detail.

1: String filename = theEvent.getFilename();

This clearly gets a String from the event.  What is the String's encoding?
It's clearly UTF-16, because a Java char is 16-bits, and Strings are a
sequence of char elements.

Does the String use composed or decomposed accents?  It uses decomposed
accents, because that's how HFS+ stores accented characters on disk:

So if the filename had an o-acute in it, that would be TWO adjacent chars:
'o' followed by combining-acute accent (U+0301).


2: byte[] theBytes = filename.getBytes();

This clearly converts the UTF-16 in the String to an encoded series of
bytes.  What encoding do the bytes represent?  That depends on the value of
the "file.encoding" property, which for a normal US English system will be
MacRoman.

So you now have an array of MacRoman bytes.  And if "file.encoding" is
something other than "MacRoman", then you have an array of bytes in that
encoding.

Let's assume "file.encoding" is "MacRoman".  Does MacRoman have a
code-point for combining-acute accent?  No, it doesn't.  Then what will
happen?  The converter may be smart, and attempt to combine accents and
then map to an o-acute in MacRoman.  But does it?  Try it and tell me.

If it does combine accents with prior letters, then does MacRoman have a
code-point for composed o-acute?  Yes, it does.  So if the conversion to
MacRoman is smart, we've also gotten a nice side-effect of combining
accents into their composed form.  But if the conversion is dumb, then the
acute-accent will be mistranslated or mangled.


3. String inUTF8 = new String(theBytes, "UTF8");

This is the one I really don't understand.  As noted above, the byte array
is most likely MacRoman, or whatever the "file.encoding" defaults to.  So
what this line is doing is interpreting a series of MacRoman bytes as
UTF8-encoded bytes, and translating that to UTF-16 chars.

Huh?  The bytes aren't even *IN* the UTF8 encoding!  So why are you asking
for them to be converted from UTF8?

Actually, this code WOULD work in one case: if "file.encoding" happened to
be "UTF8".

But if "file.encoding" was "UTF8", then all you've done is translate from
16-bit chars to UTF-8 and back again, which should be a no-op.  A
time-consuming one, but still a no-op.

Also note that if "file.encoding" were UTF8, it still won't convert
decomposed accents into composed accents, AFAIK.  That's a different
conversion entirely, and I don't think the UTF8 encoder or decoder will do
it.

So at worst this line is mangling the data beyond usability, and at best
it's a time-consuming no-op.  And the outcome depends on the value of the
"file.encoding" property, whose default is language and locale dependent.


If you have code like that 3-line fragment for processing file-drop events,
I have no idea why it's working.  Everything I understand, and everything
I've ever seen or tested, tells me that such code would fail badly when
given files that have accented letters in them.  If it's working for you, I
have no idea why.

In fact, just yesterday I was testing a file-drop Java application I've
written on a range of OS and Java versions, and I have test-files with
accents, bullets, and ligatures in them.  The app worked perfectly, even
though all I do is this:
  File file = new File( event.getFilename() );

Nowhere do I convert to bytes, MacRoman, UTF8, or anything else.

Nor do I ever have to convert to or from decomposed or composed accent
forms.  I can confirm the Finder delivers decomposed accents in the
filenames, but nothing in my code cares in the least about this fact.  The
decomposed accents do produce some odd System.out.println() diagnostic
output on Console.app, but all the filenames work when opening or creating
files, which is all I really care about.

If I did care about the most common decomposed accents, I'd lift the
AccentComposer class from my MacBinary Toolkit:
  <http://www.amug.org/~glguerin/sw/#macbinary>

And if I cared even more, I'd probably use ICU4J.


Other references, tools, normalizers, accent-composers, etc.:

Canonical Equivalence in Applications:
  <http://www.unicode.org/notes/tn5/>

UAX #15: Unicode Normalization:
  <http://www.unicode.org/reports/tr15/>

International Components for Unicode (see: ICU4J) -- 1.4+
  <http://icu.sourceforge.net/>
ICU's Normalization:
  <http://icu.sourceforge.net/userguide/normalization.html>

QA1235: Converting to Precomposed Unicode -- native calls, not Java
  <http://developer.apple.com/qa/qa2001/qa1235.html>

TN1150 HFS+/HFSX format -- find "Canonical Decomposition" on that page:
  <http://developer.apple.com/technotes/tn/tn1150.html>
  <http://developer.apple.com/technotes/tn/tn1150table.html> -- table


I hope that answered your question, though it was probably a somewhat
longer response than you expected.

  -- GG

p.s. The foregoing MAY OR MAY NOT have anything to do with dragging a
File-list from a Java app and dropping it on a Finder window.  That's a
completely different API from the EAWT ApplicationEvent API, and may
operate with different rules.  But it should now be clear where my
suspicion of composed vs. decomposed accents comes from.


 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.