Truncating UTF-8 Strings

Subject: Truncating UTF-8 Strings
From: Klaus Berkling <email@hidden>
Date: Mon, 8 Jun 2009 14:23:50 -0700

Hi all. This seems it should work but it doesn't.

I truncate a string that may contain Japanese characters, purely for display purposes. Double byte or multi-byte characters are split appart.

Results look like this:

お使いのコンピュータにDVDドライブが搭載れているかは�?

or:

Here is the code:

public String stringWithNoHTML(String aStringWithHTML, int lengthTruncated) {

String returnValue = null;

if (aStringWithHTML != null && aStringWithHTML.length() > 0) {

//StringBuffer textBlock = new StringBuffer(aStringWithHTML);

StringBuffer textBlock = new StringBuffer();

Pattern htmlTagPattern = Pattern.compile("<(.|\n|\r)+?>|&[a-zA-Z0-9]+;");

Matcher lineBreakMatcher = htmlTagPattern.matcher(aStringWithHTML);

boolean results = lineBreakMatcher.find();

while (results)

{

lineBreakMatcher.appendReplacement(textBlock, " ");

results = lineBreakMatcher.find();

}

lineBreakMatcher.appendTail(textBlock);

if (lengthTruncated > 0 && textBlock.length() > SUMMARY_LENGTH) {

try {

returnValue = new String(textBlock.toString().getBytes("UTF-8"), 0, lengthTruncated, "UTF-8");

} catch (UnsupportedEncodingException ex) {

returnValue = null;

}

//returnValue = new String(textBlock.substring(0, lengthTruncated) + "...");

} else

returnValue = textBlock.toString();

}

return returnValue;

}

The original string may contain single byte characters as well. I expect the string to be properly truncated and not chop off bytes of the characters. It works fine with single byte characters.

Using

returnValue = new String(textBlock.toString().getBytes("UTF-8"), 0, lengthTruncated, "UTF-8");

returnValue = new String(textBlock.substring(0, lengthTruncated) + "...");

makes no difference.

I also bypassed the regex patter and still see the same problem.

Files, components, class, etc. are in UTF-8.

Has anyone seen this before and is there a work-around?

Thanks

kib

"Success is not final, failure is not fatal: it is the courage to continue that counts."

Winston Churchill

Klaus Berkling

Systems Administrator

DynEd International, Inc.

www.dyned.com | www.eskimo.com/~kiberkli

Attachment: smime.p7s
Description: S/MIME cryptographic signature

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Webobjects-dev mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

Follow-Ups:
- Re: Truncating UTF-8 Strings (Resolved)
  - From: Klaus Berkling <email@hidden>
- Re: Truncating UTF-8 Strings
  - From: Andrew Lindesay <email@hidden>

Prev by Date: Inserting a new sub-component (programmatically) in appendToResponse()
Next by Date: Re: Truncating UTF-8 Strings
Previous by thread: Re: Inserting a new sub-component (programmatically) in appendToResponse()
Next by thread: Re: Truncating UTF-8 Strings
Index(es):
- Date
- Thread