Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug with Text Encoding?



Ben Spink wrote:

>Sorry for the delay. I'm on digest mode.

Please do not post all the tail-quoted digest material.

FYI, all posts, only slightly delayed, appear in the on-line archives:
  <http://lists.apple.com/archives/java-dev/2007/Apr/index.html>


>Raw bytes1: A\u0438\u0306C
>Raw bytes2: A\u0438?C

Those things labeled as "bytes" are Unicode chars, not bytes.  Labeling
them "bytes" does not make them bytes.

The visible chars include a combining accent (\u0306) which is
unrepresentable in Windows-1251.
  <http://www.microsoft.com/globaldev/reference/sbcs/1251.mspx>
That's why the \u0306 gets translated into a question-mark.

Once \u0306 has been translated into a question-mark with s1.getBytes(enc),
there's no possible way the question-mark can be reversed to the original
combining accent again.  Information has been irretrievably lost.  There is
no way to make the conversion round-trippable as-is.


The reason you get \u0438\u0306 from File.list() is that HFS+ decomposes a
\u0439 into the sequence \u0438\u0306.  Look on this page for 0438 and/or
0439:
  <http://developer.apple.com/technotes/tn/tn1150table.html>

If you want to convert \u0438\u0306 into Windows-1251, then the first thing
you must do is normalize it back into its canonical composed form.  There
are some links to Unicode normalizers here:
  <http://lists.apple.com/archives/java-dev/2007/Apr/msg00111.html>

Off the top of my head, I don't know if they work with the Cyrillic
alphabet, but it's worth a try.  You could probably find more info by
googling for java unicode normalizer.


Since you didn't show the XP run, I'm going to guess that XP returns
"A\u0439C".  Since there is a representation in Windows-1251 for \u0439,
the conversion to bytes loses no information.  As a result, the original
Unicode data is round-trippable.

However, if you were to normalize to the decomposed form on XP, or start
with the literal String:
  String s1 = "A\u0438\u0306C";
then the conversion to Windows-1251 should do exactly the same thing on XP
as it does on Mac OS X: convert the combining accent to question-mark,
destroying round-trippability.


The only bug I see here is in assuming that the conversion to Windows-1251
can proceed without first normalizing to a composed form. File-systems need
not return one form or the other, nor even a consistently normalized form.
If you assume they do, then the bug lies in that assumption.

Arguably, Java's encoders could be smarter about normalizing before
converting, but they aren't.  You have to write code with what is, not with
what you might wish to be.

Personally, I think it very unlikely that Java's encoders will ever be
changed to do normalization.  Normalization to a particular Unicode form
(composed or decomposed) is something you will probably always need to do
before attempting to call String.getBytes( "SomeEncoding" ).

  -- GG


 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.