All,
IMHO, this is a bug in the VM in that it forces you to decompose the
unicode characters. It is a nice fact to know that the filesystem
stores decomposed UTF-8 file names, but why in the world did that creep
up into the semantic contract of the File class????? In the unicode
standard, the decomposed and composed characters are intended to carry
the same meaning. While there may be nothing that states that both
should work equivalently within a Unicode application (maybe there is, I
am not sure?), it makes life difficult to have to know which form is
required. If Apple chooses to require the decomposed form in the C
APIs, fine, but have the JAVA VM do the decomposition so the WORA
contract of the core JAVA File class may be maintained. Logically, if
the non-UI File class becomes semantically platform specific, the entire
WORA model of the VM is compromised!!
As a bit of history, non-decomposed characters used to work (better) in
1.3 and 1.4, but with 1.5 all *#$!% broke loose. Remote mounted volumes
with extended character file names had their own restrictions on each of
the previous version VM versions. We filed bug report 4486240 related
to 1.5 on needing to decompose related to remote mounted volumes, but
the more I play with it (could be related to update 4) the more this
restriction keeps popping up not related to remote mounted volumes. I
hold almost no hope that this bug will ever be changed from "open".
Larry, the simple workaround is to do the decomposition yourself. If
you download the unicode data table
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), one of the
columns (column 6 I believe) is the equivalent decomposed form. If you
write a simple program (see the attached java file), to parse the file,
you can output a conversion map that looks roughly like (see the
attachment for the full list):
I have attached the full mapping file that is produced from the attached
program. A separate class that performs the mapping loads this
decomposed resource file. Hope this helps you.
Regards,
Mike
Larry Nussbaum wrote:
I've had the same problem with greek characters, inconsistent
decoding... any one know of a good decoding routine?
/**
* Copyright 2001-2006 WAVE Corporation
* All Rights Reserved.
*/
package com.wavecorp.generation;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
/**
* @author Michael Macaluso
*/
public class GenerateUnicodeAccentDecomposerData
{
public static final String UNICODE_PREPEND_SEQUENCE[];
static
{
UNICODE_PREPEND_SEQUENCE = new String[]
{
"\\u0000",
"\\u000",
"\\u00",
"\\u0",
"\\u",
};
}
public static String GetCharacterAsUnicodeSequence(int aChar)
{
String anIntString = Integer.toHexString(aChar).toUpperCase();
return UNICODE_PREPEND_SEQUENCE[anIntString.length()] + anIntString;
}
public static class UnicodeData
{
public String m_CharacterAsHexString; // 0
public Integer m_Character; // 0
public String m_Name; // 1
public String m_GeneralCategory; // 2
public Integer m_CanonicalCombiningClass; // 3
public String m_BidiClass; // 4
public String m_DecompositionType; // 5
public Integer[] m_DecompositionMapping; // 5
public String m_NumericValueDecimalDigit; // 6
public String m_NumericValueDigit; // 7
public String m_NumericValueNumeric; // 8
public Boolean m_BidiMirrored; // 9
public String m_Unicode1Name; // 10
public String m_ISOComment; // 11
public Integer m_SimpleUppercaseMapping; // 12
public Integer m_SimpleLowercaseMapping; // 13
public Integer m_SimpleTitlecaseMapping; // 14
}
public GenerateUnicodeAccentDecomposerData()
{
}
public static Integer GetCharacterFromHexString(String aHexString)
{
if (null == aHexString || aHexString.length() == 0)
{
return null;
}
try
{
return Integer.valueOf(aHexString, 16);
}
catch (Exception e)
{
e.printStackTrace();
return null;
}
}
public static void AddCharacterFromHexStringToList(String aHexString, List aList)
{
Integer aCharacter = GetCharacterFromHexString(aHexString);
if (null != aCharacter)
{
aList.add(aCharacter);
}
}
public static String GetSubString(String aString, int begin, int end)
{
if (-1 == end)
{
if (begin == aString.length())
{
return null;
}
return aString.substring(begin);
}
if (begin == end + 1)
{
return null;
}
return aString.substring(begin, end);
}
public static Map ReadUnicodeData(String aUnicodeFileName)
{
Map aReturnMap = new TreeMap();
FileReader aFileReader = null;
try
{
aFileReader = new FileReader(aUnicodeFileName);
BufferedReader in = new BufferedReader(aFileReader);
String aLine;
while (null != (aLine = in.readLine()))
{
UnicodeData aUnicodeData = new UnicodeData();
int begin = 0;
int end = -1;
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_CharacterAsHexString = GetSubString(aLine, begin, end);
aUnicodeData.m_Character = GetCharacterFromHexString(aUnicodeData.m_CharacterAsHexString);
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_Name = GetSubString(aLine, begin, end);
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_GeneralCategory = GetSubString(aLine, begin, end);
end = aLine.indexOf(';', (begin = end + 1));
String aCanonicalCombiningClassString = GetSubString(aLine, begin, end);
try
{
aUnicodeData.m_CanonicalCombiningClass = Integer.valueOf(aCanonicalCombiningClassString);
}
catch (Exception e)
{
e.printStackTrace();
}
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_BidiClass = GetSubString(aLine, begin, end);
end = aLine.indexOf(';', (begin = end + 1));
String aDecomposition = GetSubString(aLine, begin, end);
if (aDecomposition.length() > 0)
{
int startOfDecomposition = 0;
if (aDecomposition.charAt(0) == '<')
{
int endOfType = aDecomposition.indexOf('>');
startOfDecomposition = 1;
aUnicodeData.m_DecompositionType = GetSubString(aDecomposition, startOfDecomposition, endOfType);
startOfDecomposition = endOfType + 2;
}
List aDecompositionList = new ArrayList();
int indexOfSpace = aDecomposition.indexOf(' ', startOfDecomposition);
while (true)
{
try
{
String aDecompositionCharacterInHex = GetSubString(aDecomposition, startOfDecomposition, indexOfSpace);
AddCharacterFromHexStringToList(aDecompositionCharacterInHex, aDecompositionList);
}
catch (IndexOutOfBoundsException e)
{
e.printStackTrace();
break;
}
if (-1 == indexOfSpace)
{
break;
}
startOfDecomposition = indexOfSpace + 1;
indexOfSpace = aDecomposition.indexOf(' ', startOfDecomposition);
}
aUnicodeData.m_DecompositionMapping = (Integer[]) aDecompositionList.toArray(new Integer[aDecompositionList.size()]);
}
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_NumericValueDecimalDigit = GetSubString(aLine, begin, end);
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_NumericValueDigit = GetSubString(aLine, begin, end);
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_NumericValueNumeric = GetSubString(aLine, begin, end);
end = aLine.indexOf(';', (begin = end + 1));
String aBidiMirroredFlag = GetSubString(aLine, begin, end);
aUnicodeData.m_BidiMirrored = aBidiMirroredFlag.equals("Y") ? Boolean.TRUE : Boolean.FALSE;
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_Unicode1Name = GetSubString(aLine, begin, end);
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_ISOComment = GetSubString(aLine, begin, end);
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_SimpleUppercaseMapping = GetCharacterFromHexString(GetSubString(aLine, begin, end));
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_SimpleLowercaseMapping = GetCharacterFromHexString(GetSubString(aLine, begin, end));
end = aLine.indexOf(';', (begin = end + 1));
aUnicodeData.m_SimpleTitlecaseMapping = GetCharacterFromHexString(GetSubString(aLine, begin, end));
aReturnMap.put(aUnicodeData.m_Character, aUnicodeData);
}
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
if (null != aFileReader) { try { aFileReader.close(); } catch (Exception e) {} }
}
return aReturnMap;
}
public static void RecursivelyAddCanonicallyDecomposedCharacters(UnicodeData aUnicodeData, Map aLookupMap, List aCanonicalDecompositionCharacterList) throws Exception
{
Integer[] aDecompositionMapping = aUnicodeData.m_DecompositionMapping;
if (null == aDecompositionMapping)
{
aCanonicalDecompositionCharacterList.add(aUnicodeData.m_Character);
}
else
{
for (int i = 0; i < aDecompositionMapping.length; i++)
{
Integer aDecomposedCharacter = aDecompositionMapping[i];
UnicodeData aDecomposedUnicodeData = (UnicodeData) aLookupMap.get(aDecomposedCharacter);
if (null == aDecomposedUnicodeData)
{
throw new Exception("Error resolving character: " + GetCharacterAsUnicodeSequence(aDecomposedCharacter.intValue()));
}
if (aDecomposedUnicodeData.m_CharacterAsHexString.length() > 4)
{
throw new Exception("Resolve character is too long: " + aDecomposedUnicodeData.m_CharacterAsHexString);
}
RecursivelyAddCanonicallyDecomposedCharacters(aDecomposedUnicodeData, aLookupMap, aCanonicalDecompositionCharacterList);
}
}
}
/**
* @param args the command line arguments
*/
public static void main(String[] args)
{
if (args.length == 0)
{
System.err.println("Please pass in the location of a Unicode 4.0 or greater text file as the first parameter");
}
Map aMap = ReadUnicodeData(args[0]);
Map aLookupMap = (Map)((TreeMap) aMap).clone();
Iterator anIterator = aMap.entrySet().iterator();
while (anIterator.hasNext())
{
Map.Entry aMapEntry = (Map.Entry) anIterator.next();
Integer aCharacter = (Integer) aMapEntry.getKey();
UnicodeData aUnicodeData = (UnicodeData) aMapEntry.getValue();
if (
null != aUnicodeData.m_DecompositionType // We only want Canonical Decompositions
|| null == aUnicodeData.m_DecompositionMapping // That are not blank
|| aUnicodeData.m_CharacterAsHexString.length() > 4 // That we can represent as a JAVA character
)
{
continue;
}
List aCanonicalDecompositionCharacterList = new LinkedList();
try
{
RecursivelyAddCanonicallyDecomposedCharacters(aUnicodeData, aLookupMap, aCanonicalDecompositionCharacterList);
}
catch (Exception e)
{
// This error is raised when a character in a decomposition list is unable to be looked-up
// Eat this now and ignore this mapping
continue;
}
System.out.print(GetCharacterAsUnicodeSequence(aCharacter.intValue()));
System.out.print('=');
Iterator aCanonicalDecompositionCharacterListIterator = aCanonicalDecompositionCharacterList.iterator();
while (aCanonicalDecompositionCharacterListIterator.hasNext())
{
Integer aDecomposedCharacter = (Integer) aCanonicalDecompositionCharacterListIterator.next();
System.out.print(GetCharacterAsUnicodeSequence(aDecomposedCharacter.intValue()));
}
System.out.println();
}
}
}
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Java-dev mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/java-dev/email@hidden
This email sent to email@hidden