Mailing Lists: Apple Mailing Lists

Image of Mac OS face in stamp
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Need help with character encoding



Kevin Hoyt wrote:

>However, the representation returned by FSGetCatalogInfoBulk shows the
>character as: the letter "A (0x0041) Combining Diaeresis (0x0308) and
>Combining Macron (0x0304)"

That is consistent with HFS+ behavior and the FS* API.

TN1150 HFS+/HFSX format -- find "Canonical Decomposition" on that page:
  <http://developer.apple.com/technotes/tn/tn1150.html>
  <http://developer.apple.com/technotes/tn/tn1150table.html> -- table


>When I send "A (0x0041) Combining Diaeresis (0x0308) and Combining Macron
>(0x0304)" to the Java app, I see the A followed by 2 squares.

Seen how?  What are you using to present or render the chars?

Some components might not be as smart as others.  Also, it may depend on
the font, as I vaguely recall.


>So what I think needs to happen is to convert the  "A (0x0041) Combining
>Diaeresis (0x0308) and Combining Macron (0x0304)" to the "A with diaeresis
>(0x00C4) and Combining Macron (0x0304)".

First, I suggest confirming that the combined "A with Diaresis" followed by
"Combining Macron" actually renders correctly.  You can do that pretty
easily by using a String literal in a test program and then rendering the
String.  Use the Unicode-escapes in the literal:
  String combined = "\u00c4\u0304";

I would test other combinations, too.


> It's my belief this would be a
>conversion from decomposed to composed Unicode.

That is indeed the conversion.


>For completeness, on the Java side, we create a String from a byte buffer.
> Perhaps we need to tell the Java code how the bytes are encoded?  If so,
>what encoding should we use for the data we get from the Carbon API?

All the chars you're working with are already Unicode, which means the
problem has nothing to do with the character ENCODING, as the term is used
in Java, but with the composition or canonicalization of the Unicode
code-points.  There's no encoding I know of that performs both a charset
translation AND a combining-accent composition.  The issues are orthogonal.


>However, I do not see a way to do this...

Agreed.  Nothing I know of in J2SE transforms Unicode compositions.  A
significant omission, IMNSHO.

When I solved the problem, I wrote my own class: AccentComposer.  It's in
my open source MacBinary Toolkit for Java:
 <http://www.amug.org/~glguerin/sw/#macbinary>

It's moderately configurable, but it won't scale up too far because of its
relatively simple implementation.

There may be some other Java class that performs the canonical Unicode
compositions, so a web-search might be worth doing.


Here's my list of Frequently Pasted URLs on the subject of Unicode
composition and forms.

UTF-16 for Processing:
  <http://www.unicode.org/notes/tn12/>

Minimum Knowledge on Charsets:
  <http://www.joelonsoftware.com/articles/Unicode.html>

Forms of Unicode (written in 1999, but still good):
  <http://www.ibm.com/developerworks/library/utfencodingforms/>

  -- GG


 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden

This email sent to email@hidden



Visit the Apple Store online or at retail locations.
1-800-MY-APPLE

Contact Apple | Terms of Use | Privacy Policy

Copyright © 2007 Apple Inc. All rights reserved.