Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug with Text Encoding?



Ben Spink wrote:

>   System.out.print("Raw bytes1: ");printBytes(s1.getBytes());

A Java String is a sequence of Java chars.
A Java char is a 16-bit Unicode code point.
A Java char is NOT a byte.

When you call String.getBytes(), it converts the Unicode codes into bytes
using the default encoding.  On Mac OS X, the default encoding depends on
your chosen primary language (System Preferences, International pane,
Languages tab, list of languages).

You can see the default encoding name in the "file.encoding" system property.

On a typical Western-European language Mac OS X config, the default
encoding will be MacRoman.  Therefore, when you call printBytes(), you are
actually printing the MacRoman-encoded version of the Unicode code points.
This will NOT tell you what the original Unicode text of the String is.
All it does is tell you what the default-encoded bytes of the String are.
The Unicode text may well be unencodeable in the default encoding.  Or the
encoding may not be reversable.  Or other bad things.

You need to print the Unicode chars in the String, as 16-bit values.
Printing bytes is meaningless.  A Java char is not a byte.


>  String enc = "Windows-1251";
> ...
>  String s2 = new String(s1.getBytes(enc),enc);

If the Windows-1251 encoding can represent the Unicode chars in s1, this
should be round-trippable.  Your output log from Mac OS X demonstrates that
it is NOT round-trippable.

To find out why, you need to look at the actual 16-bit chars in the actual
Strings, not at the byte-encoded conversions.


>Here are the results from my XP machine with Java 1.4.2 and 1.5:
>Output:
>Raw bytes1: 65 63 67
>Windows-1251 bytes1: 65 -55 67
>Raw bytes2: 65 63 67
>Windows-1251 bytes2: 65 -55 67
>Success: true

This shows a few things:
  1. Windows returns a different representation from list().
  2. The XP default encoding differs from Mac OS X's.
  3. The XP default encoding is not Windows-1251.

What it does NOT show is what the representation from list() actually is
for XP.  You haven't shown what the list() representation for Mac OS X is,
either.

I strongly suspect that when you look at what the actual Unicode
representation is, on each platform, it will explain a whole lot about why
your code is getting the results it is.  It may explain everything, but
since I don't know what the Unicode code-points are for your "A-backwards N
with tick on it-C" I can't really test this.

You need to be very specific about what your Unicode text is.  If you print
the hex values of the Java chars, you can represent them in literal source
text as "\uXXXX" sequences.  A Properties file also accepts \uXXXX
sequences in property values.

Otherwise you have to tell us exactly which keyboard layout and keystroke
sequence you used to produce the filename.


>How can two strings with the exact same reported byte sequence not "equal"
>each other?

Equal bytes does not imply equal Unicode chars.

String.getBytes() returns an ENCODED representation, and not all encodings
have one-to-one mappings with Unicode.  In fact, most don't.  UTF-8 does,
but Unicode itself also has canonical composed and decomposed forms, which
alters the UTF8 accordingly.  See the references I cited earlier.


>Can anyone successfully convert a string with that char using windows-1251
>to bytes, and reverse that back to the same string on a OS X?

I suspect the Mac is giving you a decomposed form, but you need to look at
the 16-bit chars, not the bytes.

To normalize from decomposed to composed, see the references I cited
earlier; there are several normalizers.


>By scrambled, I mean the unicode value of the string is "A-backwards N
>followed by a question mark-C". That's not the same unicode string I
>started with.

And it won't be if Windows-1251 can't represent combining accents, which is
the canonical decomposed form, and which I suspect is what Mac OS X is
giving you from list().

That's why you need to figure out what the original 16-bit chars in the
String are.

  -- GG


  /**
  ** Return a String with all chars made visible.
  ** Chars in the range 0x00-0x1F are represented as UniCode escapes.
  ** Chars in the range 0x20-0x7E are represented as-is,
  ** EXCEPT that a '\' is represented as two backslashes (escaped form).
  ** Chars in the range 0x7F and up are represented as UniCode escapes.
  */
  public static String
  visible( String text )
  {
    StringBuffer build = new StringBuffer( text.length() );

    for ( int i = 0;  i < text.length();  ++i )
    {
      int each = text.charAt( i );
      if ( each < 0x20  ||  each >= 0x7F )
        build.append( "\\u" ).append( hex( each, 4 ) );
      else
      {
        // Represented as-is, except...
        build.append( (char) each );

        // ...backslash (0x5C) is doubled in output text.
        if ( each == 0x5C )
          build.append( (char) each );
      }
    }

    return ( build.toString() );
  }

  /** Return a zero-filled upper-case hex sequence. */
  public static char[]
  hex( int value, int digits )
  {
    char[] chars = new char[ digits ];
    for ( int i = chars.length;  --i >= 0;  value >>>= 4 )
    {  chars[ i ] = "0123456789ABCDEF".charAt( value & 0x0F );  }

    return ( chars );
  }


 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.