Hi all. This seems it should work but it doesn't.
I truncate a string that may contain Japanese characters, purely for display purposes. Double byte or multi-byte characters are split appart.
Results look like this:
お使いのコンピュータにDVDドライブが搭載れているかは�?
[...]
Here is the code:
public String stringWithNoHTML(String aStringWithHTML, int lengthTruncated) {
String returnValue = null;
if (aStringWithHTML != null && aStringWithHTML.length() > 0) {
//StringBuffer textBlock = new StringBuffer(aStringWithHTML);
StringBuffer textBlock = new StringBuffer();
Pattern htmlTagPattern = Pattern.compile("<(.|\n|\r)+?>|&[a-zA-Z0-9]+;");
Matcher lineBreakMatcher = htmlTagPattern.matcher(aStringWithHTML);
boolean results = lineBreakMatcher.find();
while (results)
{
lineBreakMatcher.appendReplacement(textBlock, " ");
results = lineBreakMatcher.find();
}
lineBreakMatcher.appendTail(textBlock);
if (lengthTruncated > 0 && textBlock.length() > SUMMARY_LENGTH) {
try {
returnValue = new String(textBlock.toString().getBytes("UTF-8"), 0, lengthTruncated, "UTF-8");
} catch (UnsupportedEncodingException ex) {
returnValue = null;
}
//returnValue = new String(textBlock.substring(0, lengthTruncated) + "...");
} else
returnValue = textBlock.toString();
}
return returnValue;
}
The original string may contain single byte characters as well. I expect the string to be properly truncated and not chop off bytes of the characters. It works fine with single byte characters.
Using
returnValue = new String(textBlock.toString().getBytes("UTF-8"), 0, lengthTruncated, "UTF-8");
or
returnValue = new String(textBlock.substring(0, lengthTruncated) + "...");
makes no difference.
I also bypassed the regex patter and still see the same problem.
Files, components, class, etc. are in UTF-8.