• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: Truncating UTF-8 Strings (Resolved)
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Truncating UTF-8 Strings (Resolved)


  • Subject: Re: Truncating UTF-8 Strings (Resolved)
  • From: Klaus Berkling <email@hidden>
  • Date: Wed, 10 Jun 2009 08:33:23 -0700


On Jun 8, 2009, at 2:23 PM, Klaus Berkling wrote:

Hi all.  This seems it should work but it doesn't.

I truncate a string that may contain Japanese characters, purely for display purposes. Double byte or multi-byte characters are split appart.

Results look like this: 

お使いのコンピュータにDVDドライブが搭載れているかは�?

[...]
Here is the code:

public String stringWithNoHTML(String aStringWithHTML, int lengthTruncated) {
String returnValue = null;
if (aStringWithHTML != null && aStringWithHTML.length() > 0) {

//StringBuffer textBlock = new StringBuffer(aStringWithHTML);
StringBuffer textBlock = new StringBuffer();

Pattern htmlTagPattern = Pattern.compile("<(.|\n|\r)+?>|&[a-zA-Z0-9]+;");
Matcher lineBreakMatcher = htmlTagPattern.matcher(aStringWithHTML);

boolean results = lineBreakMatcher.find();
while (results)
{
lineBreakMatcher.appendReplacement(textBlock, " ");
results = lineBreakMatcher.find();
}
lineBreakMatcher.appendTail(textBlock);

if (lengthTruncated > 0 && textBlock.length() > SUMMARY_LENGTH) {
try {
returnValue  = new String(textBlock.toString().getBytes("UTF-8"), 0, lengthTruncated, "UTF-8");
} catch (UnsupportedEncodingException ex) {
returnValue = null;
}
//returnValue = new String(textBlock.substring(0, lengthTruncated) + "...");
} else
returnValue = textBlock.toString();
}
return returnValue;
}

The original string may contain single byte characters as well.  I expect the string to be properly truncated and not chop off bytes of the characters. It works fine with single byte characters.

Using
returnValue  = new String(textBlock.toString().getBytes("UTF-8"), 0, lengthTruncated, "UTF-8");
or
returnValue = new String(textBlock.substring(0, lengthTruncated) + "...");
makes no difference.
I also bypassed the regex patter and still see the same problem.

Files, components, class, etc. are in UTF-8.


(For the archive)
After a chat with the Java people at WWDC, this code seems to make the proper truncation:

int correctLengthTrucated = lengthTruncated;
while (correctLengthTrucated > 0)
//if ( Character.isWhitespace(textBlock.charAt(correctLengthTrucated)) )
if ( Character.isLetter(textBlock.charAt(correctLengthTrucated)) )
break;
else
correctLengthTrucated--;
returnValue = new String(textBlock.substring(0, correctLengthTrucated) + "...");


Thanks to all who helped.

kib

"Success is not final, failure is not fatal: it is the courage to continue that counts."
Winston Churchill

Klaus Berkling
Systems Administrator
DynEd International, Inc.
www.dyned.com | www.eskimo.com/~kiberkli





Attachment: smime.p7s
Description: S/MIME cryptographic signature

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

References: 
 >Truncating UTF-8 Strings (From: Klaus Berkling <email@hidden>)

  • Prev by Date: nsruleeditor (or nsprediateeditor) WebObjects equivalent
  • Next by Date: Re: Rich Clients and WO
  • Previous by thread: Re: Truncating UTF-8 Strings
  • Next by thread: [WWDC] Any supper plans for Monday?
  • Index(es):
    • Date
    • Thread