Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Mac character encoding and getInputStream()



you may also try if the stripped down normalizer can help:

http://sourceforge.net/project/showfiles.php? group_id=173651&package_id=203870&release_id=448310

Am 27.01.2007 um 19:07 schrieb Greg Guerin:

Martin Edling Andersson wrote:

I've tried both UTF-8 and UTF-16. None gives me the correct encoding.

and earlier:

The character <small a ring above> becomes a\xcc\x8a and the character
<small a diaresis> becomes o\xcc\x8 and so forth.

I think you may be seeing the filenames in their raw on-disk format, which
is Unicode with combining accents. This doesn't explain everything you
cited, but it would explain a lot. This is also called "decomposed
canonical" format**.


The encoding for this would be UTF8, but you'd end up with combining accent
marks that follow every accented character. All combining marks are in a
block at U+0300, so I suggest decoding with UTF8, and dumping the char data
as hex. Either that, or read the raw data and decode the bytes as hex, so
we can confirm it's UTF8 with decomposed accents.


** The disk data for HFS+ doesn't exactly follow Unicode's canonical
decomposition.  See HFS+ links below.

Here's my complete list of frequently pasted URLs on this subject.

## Unicode: Composing Accents, Normalization, HFS+ ##
Canonical Equivalence in Applications:
  <http://www.unicode.org/notes/tn5/>
UAX #15: Unicode Normalization:
  <http://www.unicode.org/reports/tr15/>

International Components for Unicode (see: ICU4J) -- J2SE 1.4+
  <http://icu.sourceforge.net/>
ICU's normalization summary:
  <http://icu.sourceforge.net/userguide/normalization.html>

A smaller Normalizer (uses ICU and IBM parts):
<http://sourceforge.net/project/showfiles.php? group_id=173651&package_id=203870>
...and its historical context (see entire message thread)...
<http://lists.apple.com/archives/java-dev/2006/Sep/msg00169.html>


Mac OS X Normalization:
<http://developer.apple.com/documentation/MacOSX/Conceptual/ BPInternational/Articles/FileEncodings.html>
<http://developer.apple.com/qa/qa2001/qa1235.html>


QA1235: Converting to Precomposed Unicode -- native calls, not Java
  <http://developer.apple.com/qa/qa2001/qa1235.html>

HFS+/HFSX text-encoding, TN2078 & TN1150:
  <http://developer.apple.com/technotes/tn2002/tn2078.html#HowEncoded>
  <http://developer.apple.com/technotes/tn/tn1150.html>
  <http://developer.apple.com/technotes/tn/tn1150table.html> -- table

  -- GG


_______________________________________________ Do not post admin requests to the list. They will be ignored. Java-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden

_______________________________________________ Do not post admin requests to the list. They will be ignored. Java-dev mailing list (email@hidden) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden
References: 
 >Re: Mac character encoding and getInputStream() (From: Greg Guerin <email@hidden>)



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.