Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Java-dev Digest, Vol 4, Issue 150



Here are additional details and a better example Java class for this.

import java.io.*;
public class Enc {
	public static void main(String[] args) throws Exception {
		String enc = "Windows-1251";
		String s1 = new File("/oneFolder/").list()[0];
		System.out.print("Raw bytes1: ");printBytes(s1.getBytes());
		System.out.print(enc+" bytes1: ");printBytes(s1.getBytes(enc));
		String s2 = new String(s1.getBytes(enc),enc);
		System.out.print("Raw bytes2: ");printBytes(s2.getBytes());
		System.out.print(enc+" bytes2: ");printBytes(s2.getBytes(enc));
		System.out.println("Success: "+s1.equals(s2));
	}
	public static void printBytes(byte b[]) {
		for (int x=0; x<b.length; x++) System.out.print((int)b[x]+" ");
		System.out.println("");
	}
}


Output: Raw bytes1: 65 63 63 67 Windows-1251 bytes1: 65 -24 63 67 Raw bytes2: 65 63 63 67 Windows-1251 bytes2: 65 -24 63 67 Success: false

***Notice the success was false!

I get identical results with Java 1.4.2, and 1.5:
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_12-269)
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)

The item its getting from the list is still the same "A-backwards N with tick on it-C".

Here are the results from my XP machine with Java 1.4.2 and 1.5:
Output:
Raw bytes1: 65 63 67
Windows-1251 bytes1: 65 -55 67
Raw bytes2: 65 63 67
Windows-1251 bytes2: 65 -55 67
Success: true

***Notice the success was true!

Same code run on both machines.

This is OS 10.4.9, Intel Mini.

How can two strings with the exact same reported byte sequence not "equal" each other?

Can anyone successfully convert a string with that char using windows-1251 to bytes, and reverse that back to the same string on a OS X?

By scrambled, I mean the unicode value of the string is "A-backwards N followed by a question mark-C". That's not the same unicode string I started with.

Thanks,
Ben





------------------------------

Message: 2
Date: Tue, 17 Apr 2007 22:39:50 -0700
From: Greg Guerin <email@hidden>
Subject: Re: Bug with Text Encoding?
To: email@hidden
Message-ID: <l03130300c24b570218d1@[192.168.11.2]>
Content-Type: text/plain; charset="us-ascii"

Ben Spink wrote:

public static void main(String[] args) throws Exception{
String s = new File("/testFolder/").list()[0];
if (s.startsWith(".DS_")) s = new File("/testFolder/").list()[1];
ByteArrayOutputStream bao = new ByteArrayOutputStream(100);
bao.write(s.getBytes("Windows-1251"));
byte b[] = bao.toByteArray();
String s2 = new String(b,"Windows-1251");
if (s.equals(s2)) System.out.println("Success");
else System.out.println("Failure");
}
...
However, after converting it to the encoding "Windows-1251" which
definitely supports this character, then converting it back, I don't get
the same string. Its been scrambled slightly.

Exactly what does "scrambled slightly" mean?

The data in bao is encoded from some chars that have precise binary
patterns in them.  What are those binary patterns?

The resulting bytes in bao should be precise binary patterns. What are
those binary patterns?


Reconverting the bytes from bao back into a String will result in precise
binary patterns. What are those binary patterns?


In short, stop being high-level about it and glossing over details, and go
find out exactly what the binary patterns are at each stage, and confirm
they represent what you think they do.


I would personally be very suspicious of what File.list() returns for
filenames, because it's well-known that the HFS+ storage format is Apple's
decomposed Unicode. Search the list archives for details.


The string you see may very well show up exactly as you think it should
when you display it in the debugger, but all that means is that the
rendering you see is what you expect. It may have a very different binary
pattern from what you expect, and that specific binary pattern may not
directly translate into Windows-1251 without an intervening step.


For example, the common Windows encoding CP1252 has codes for many
precomposed accented Roman letters. However, if you were to File.list() a
filename with an acute-E, it would come back from HFS+ as capital-E
followed by a combining acute accent. That sequence is not directly
representable in CP1252 as-is. The combining accent must first be composed
with the E into an acute-E, and then that single code can be translated to
CP1252. However, when rendered in decomposed Unicode form, the sequence
capital-E combining-acute renders exactly the same as the composed Unicode
form acute-E. In other words, you see one thing and assume it's what you
expected, but there are at least two very different binary patterns that
render to the exact same visible pixels. Welcome to Unicode.


Search recent list postings for keywords: Unicode, accents, composed,
decomposed.  For example, see:
  <http://lists.apple.com/archives/java-dev/2007/Apr/msg00111.html>


When I run this "identical" code on my XP machine with Java 1.5, it works
correctly. (I have to alter the path...but that's it.)

Exactly what binary patterns is XP returning from File.list()? How do they
compare to the binary patterns for the supposedly equivalent filename on
Mac OS X's File.list()?


And which Java version is malfunctioning on Mac OS X? Given no other info,
I would assume 1.5, except that a recent post of yours complains about a
bug with 1.5 and implies that you've reverted to 1.4.2., so I can't tell
what your reference is or what context you found this problem under. For
example, it could be that you discovered this Unicode problem when testing
under 1.4, but that's just a guess. Unless you tell us what the context
is, all anyone can do is guess.


In particular, 1.5's encoders may well be smarter than 1.4's, because 1.5
knows more about combining accents being "attached" to predecessor chars.
1.4 is pretty stupid about certain crucial aspects of Unicode, and that
stupidity only got alleviated in 1.5.



It does this with other characters too...this is just an example.
Characters that should be supported are not supported and get scrambled on
the encode, decode operation.

You'll have to be specific about what fails, and exactly what "scrambled"
means.


  -- GG


_______________________________________________ Do not post admin requests to the list. They will be ignored. Java-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden


Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.