Bad Characters from Unicode
Bad Characters from Unicode
- Subject: Bad Characters from Unicode
- From: Luther Fuller <email@hidden>
- Date: Sat, 29 Sep 2007 20:22:24 -0500
[was "Bad Character"]
Yesterday, I found a scripting addition called TextCommands that has
the commands 'unicode number' and 'unicode character' that allow me
to construct unicode test strings. Using these unicode strings, I
have been able to test some scripts and now I think I see what the
problem is.
I searched the web for unicode tables and easily found a couple of
good ones. Unicode characters numbered 128 thru 159 are "control"
characters, but some of them also have printable character
assignments. For example, character 138 = [ANSI - S caron; MacRoman -
a diaeresis]. Characters 128, 129 and 130 don't seem to cause a
problem, however, characters 131 thru 159 are all very "bad
characters" when converted to ascii text. (I haven't done any testing
on unicode text, only on ascii text converted from unicode.)
WHAT'S THE PROBLEM -- Why are these characters bad?
1. When using text item delimiters set to {space & space}, double
spaces to the right of one of these bad characters are not
recognized. (This is what broke my code.)
2. With one exception, displaying these characters in any application
gives inconsistent results. The exception is Mail, which correctly
displays char 138 as [S caron]. Other applications interpret this as
[MacRoman - a diaeresis] and other applications don't display
anything at all.
3. I don't trust them - there may be other as yet undiscovered
problems with them.
WHAT'S THE SOLUTION?
I replaced code that used text item delimiters with code that used
other methods to do the same thing and was pleased with the results.
However, the bad characters are not removed and I would like to keep
the old code using text item delimiters. Since I don't trust these
bad characters, I think it best to simply remove them.
ANOTHER SOLUTION
I wrote code to remove the bad characters from an ascii text string.
This was followed by the old code using text item delimiters = {space
& space} ... and it didn't work! It seems that the part of the ascii
string to the right of the bad characters remains poisoned even after
removing the bad characters. (This is a hint that perhaps the
conversion from unicode to ascii is putting null characters into the
string. Maybe?)
A SOLUTION THAT WORKS
The solution that works is to remove the bad characters from the
unicode string before converting to ascii. Here's the code ...
set charList to (characters of uniText) as list
repeat with i from 1 to (count items of charList)
ASCII number ((item i of charList) as text)
if (130 < the result) and (the result < 160) then
set item i of charList to ""
end if
end repeat
set AppleScript's text item delimiters to {} -- very necessary
set asciiText to (charList as text)
It would be nice if the 'unicode number' and 'unicode character'
commands were part of StandardAdditions. I dislike having to use work-
arounds! It's very odd that AppleScript has unicode text but not the
commands to manipulate it.
It would be even better if these bad characters were translated
properly!
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden