Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug with Text Encoding?




On Apr 20, 2007, at 8:59 PM, Greg Guerin wrote:

If you want to convert \u0438\u0306 into Windows-1251, then the first thing
you must do is normalize it back into its canonical composed form. There
are some links to Unicode normalizers here:
<http://lists.apple.com/archives/java-dev/2007/Apr/msg00111.html>


Off the top of my head, I don't know if they work with the Cyrillic
alphabet, but it's worth a try.  You could probably find more info by
googling for java unicode normalizer.

OK, just got a little bit curious about this for some reason.
It does seem a little tricky with OS X java to see correct results here. You don't seem to ever see it correctly from Terminal, period. You can get them right, sometimes, from java Text components.
First you have to have something other than english first in the international system preferences to get the 'ru' locale.
I chose русский but other cyrillic looking choices might also work.
(I don't know how the mail process will handle some of these characters, my apologies if they don't get through correctly)
Then I set my test case like...
public class TestEnc {


public static void main(String[] args) {
System.out.println(System.getProperty("java.version"));
System.out.println(System.getProperty("user.language"));
System.out.println(System.getProperty("user.country"));
System.out.println(java.util.Locale.getDefault().getLanguage());
String tstr1 = new String("A\u0438\u0306C");
String tstr2 = new String("A\u0438C");
String tstr3 = new String("A\u0439C");
try {
System.out.println(new String(tstr1.getBytes()));
System.out.println(new String(tstr1.getBytes(),"ISO-8859-1"));
System.out.println(new String(tstr1.getBytes(),"MacRoman"));
System.out.println(new String(tstr1.getBytes(),"Windows-1251"));
System.out.println("*******");
System.out.println(new String(tstr2.getBytes()));
System.out.println(new String(tstr2.getBytes(),"ISO-8859-1"));
System.out.println(new String(tstr2.getBytes(),"MacRoman"));
System.out.println(new String(tstr2.getBytes(),"Windows-1251"));
System.out.println("*******");
System.out.println(new String(tstr3.getBytes()));
System.out.println(new String(tstr3.getBytes(),"ISO-8859-1"));
System.out.println(new String(tstr3.getBytes(),"MacRoman"));
System.out.println(new String(tstr3.getBytes(),"Windows-1251"));
}
catch (java.io.UnsupportedEncodingException uee) { uee.printStackTrace(); }
}

}


Which with a correct locale text component gets...

1.5.0_07
ru
RU
ru
Aи?C
Aè?C
AË?C
Aи?C
*******
AиC
AèC
AËC
AиC
*******
AйC
AéC
AÈC
AйC

I'm not sure MacRoman is actually a correct encoding. Sort of odd that the default encodings seem to work correctly and differently than MacRoman hard-coded. I might be mis-remembering the correct dashes for ISO-8859-1 too for that matter.
If it is a little strange that different latin character sets would be different in the cyrillic unicode range, if I do have them right?


But the OP might want to keep in mind he needs the international preferences set right and it's best to avoid Terminal for testing.

Mike Hall        mikehall at spacestar dot net
http://www.spacestar.net/users/mikehall
http://sourceforge.net/projects/macnative



_______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden
References: 
 >Re: Bug with Text Encoding? (From: Greg Guerin <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.