Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Bad Characters from Unicode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Bad Characters from Unicode

Subject: Bad Characters from Unicode
From: Luther Fuller <email@hidden>
Date: Sat, 29 Sep 2007 20:22:24 -0500

[was "Bad Character"]

Yesterday, I found a scripting addition called TextCommands that has the commands 'unicode number' and 'unicode character' that allow me to construct unicode test strings. Using these unicode strings, I have been able to test some scripts and now I think I see what the problem is.

I searched the web for unicode tables and easily found a couple of good ones. Unicode characters numbered 128 thru 159 are "control" characters, but some of them also have printable character assignments. For example, character 138 = [ANSI - S caron; MacRoman - a diaeresis]. Characters 128, 129 and 130 don't seem to cause a problem, however, characters 131 thru 159 are all very "bad characters" when converted to ascii text. (I haven't done any testing on unicode text, only on ascii text converted from unicode.)

WHAT'S THE PROBLEM -- Why are these characters bad? 1. When using text item delimiters set to {space & space}, double spaces to the right of one of these bad characters are not recognized. (This is what broke my code.) 2. With one exception, displaying these characters in any application gives inconsistent results. The exception is Mail, which correctly displays char 138 as [S caron]. Other applications interpret this as [MacRoman - a diaeresis] and other applications don't display anything at all. 3. I don't trust them - there may be other as yet undiscovered problems with them.

WHAT'S THE SOLUTION? I replaced code that used text item delimiters with code that used other methods to do the same thing and was pleased with the results. However, the bad characters are not removed and I would like to keep the old code using text item delimiters. Since I don't trust these bad characters, I think it best to simply remove them.

ANOTHER SOLUTION I wrote code to remove the bad characters from an ascii text string. This was followed by the old code using text item delimiters = {space & space} ... and it didn't work! It seems that the part of the ascii string to the right of the bad characters remains poisoned even after removing the bad characters. (This is a hint that perhaps the conversion from unicode to ascii is putting null characters into the string. Maybe?)

A SOLUTION THAT WORKS The solution that works is to remove the bad characters from the unicode string before converting to ascii. Here's the code ...

set charList to (characters of uniText) as list
repeat with i from 1 to (count items of charList)
	ASCII number ((item i of charList) as text)
	if (130 < the result) and (the result < 160) then
		set item i of charList to ""
	end if
end repeat
set AppleScript's text item delimiters to {} -- very necessary
set asciiText to (charList as text)

It would be nice if the 'unicode number' and 'unicode character' commands were part of StandardAdditions. I dislike having to use work- arounds! It's very odd that AppleScript has unicode text but not the commands to manipulate it.

It would be even better if these bad characters were translated properly!

_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

Follow-Ups:
- Re: Bad Characters from Unicode
  - From: Luther Fuller <email@hidden>

References:
	>Bad Character (From: Luther Fuller <email@hidden>)

Prev by Date: Applescript hanging after sleep? Partly?
Next by Date: Re: Applescript hanging after sleep? Partly?
Previous by thread: Re: Bad Character
Next by thread: Re: Bad Characters from Unicode
Index(es):
- Date
- Thread