I get identical results with Java 1.4.2, and 1.5:
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_12-269)
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164)
The item its getting from the list is still the same "A-backwards N
with tick on it-C".
Here are the results from my XP machine with Java 1.4.2 and 1.5:
Output:
Raw bytes1: 65 63 67
Windows-1251 bytes1: 65 -55 67
Raw bytes2: 65 63 67
Windows-1251 bytes2: 65 -55 67
Success: true
***Notice the success was true!
Same code run on both machines.
This is OS 10.4.9, Intel Mini.
How can two strings with the exact same reported byte sequence not
"equal" each other?
Can anyone successfully convert a string with that char using
windows-1251 to bytes, and reverse that back to the same string on a
OS X?
By scrambled, I mean the unicode value of the string is "A-backwards
N followed by a question mark-C". That's not the same unicode string
I started with.
Thanks,
Ben
------------------------------
Message: 2
Date: Tue, 17 Apr 2007 22:39:50 -0700
From: Greg Guerin <email@hidden>
Subject: Re: Bug with Text Encoding?
To: email@hidden
Message-ID: <l03130300c24b570218d1@[192.168.11.2]>
Content-Type: text/plain; charset="us-ascii"
Ben Spink wrote:
public static void main(String[] args) throws Exception{
String s = new File("/testFolder/").list()[0];
if (s.startsWith(".DS_")) s = new File("/testFolder/").list()[1];
ByteArrayOutputStream bao = new ByteArrayOutputStream(100);
bao.write(s.getBytes("Windows-1251"));
byte b[] = bao.toByteArray();
String s2 = new String(b,"Windows-1251");
if (s.equals(s2)) System.out.println("Success");
else System.out.println("Failure");
}
...
However, after converting it to the encoding "Windows-1251" which
definitely supports this character, then converting it back, I
don't get
the same string. Its been scrambled slightly.
Exactly what does "scrambled slightly" mean?
The data in bao is encoded from some chars that have precise binary
patterns in them. What are those binary patterns?
The resulting bytes in bao should be precise binary patterns. What
are
those binary patterns?
Reconverting the bytes from bao back into a String will result in
precise
binary patterns. What are those binary patterns?
In short, stop being high-level about it and glossing over details,
and go
find out exactly what the binary patterns are at each stage, and
confirm
they represent what you think they do.
I would personally be very suspicious of what File.list() returns for
filenames, because it's well-known that the HFS+ storage format is
Apple's
decomposed Unicode. Search the list archives for details.
The string you see may very well show up exactly as you think it
should
when you display it in the debugger, but all that means is that the
rendering you see is what you expect. It may have a very different
binary
pattern from what you expect, and that specific binary pattern may not
directly translate into Windows-1251 without an intervening step.
For example, the common Windows encoding CP1252 has codes for many
precomposed accented Roman letters. However, if you were to
File.list() a
filename with an acute-E, it would come back from HFS+ as capital-E
followed by a combining acute accent. That sequence is not directly
representable in CP1252 as-is. The combining accent must first be
composed
with the E into an acute-E, and then that single code can be
translated to
CP1252. However, when rendered in decomposed Unicode form, the
sequence
capital-E combining-acute renders exactly the same as the composed
Unicode
form acute-E. In other words, you see one thing and assume it's
what you
expected, but there are at least two very different binary patterns
that
render to the exact same visible pixels. Welcome to Unicode.
When I run this "identical" code on my XP machine with Java 1.5,
it works
correctly. (I have to alter the path...but that's it.)
Exactly what binary patterns is XP returning from File.list()? How
do they
compare to the binary patterns for the supposedly equivalent
filename on
Mac OS X's File.list()?
And which Java version is malfunctioning on Mac OS X? Given no
other info,
I would assume 1.5, except that a recent post of yours complains
about a
bug with 1.5 and implies that you've reverted to 1.4.2., so I can't
tell
what your reference is or what context you found this problem
under. For
example, it could be that you discovered this Unicode problem when
testing
under 1.4, but that's just a guess. Unless you tell us what the
context
is, all anyone can do is guess.
In particular, 1.5's encoders may well be smarter than 1.4's,
because 1.5
knows more about combining accents being "attached" to predecessor
chars.
1.4 is pretty stupid about certain crucial aspects of Unicode, and
that
stupidity only got alleviated in 1.5.
It does this with other characters too...this is just an example.
Characters that should be supported are not supported and get
scrambled on
the encode, decode operation.
You'll have to be specific about what fails, and exactly what
"scrambled"
means.
-- GG
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden