Re: Sorting characters of the text - script doesn't work as expected
Re: Sorting characters of the text - script doesn't work as expected
- Subject: Re: Sorting characters of the text - script doesn't work as expected
- From: "Nigel Garvey" <email@hidden>
- Date: Sun, 28 May 2017 15:35:49 +0100
Yvan KOENIG wrote on Fri, 26 May 2017 19:35:17 +0200:
>what is the true reason why the script freeze when it is asked to
execute :
>set MyDoc to text of document i
>on my machine.
On my own machine, the freeze occurs in the following line, when the
script tries to count the words. I think this may be connected with the
extraneous codes you mention:
>Minor complementary problem :
>how were the infamous extraneous couples "FC FF" inserted in the
original
>files ?
I've had a go at writing my own version of the script which doesn't use
TextEdit, uses regex to identify the words and check for Russian and
Lithuanian characters, and only generates the data it actually needs.
The regex isn't affected by the extraneous characters, so there's no
freeze and the problem characters aren't identified as words. The word
count's thus more accurate in this respect, but I don't know how similar
my "word" implementation is to that on Russian systems!
use AppleScript version "2.4" -- Mac OS 10.10 (Yosemite) or later
use framework "Foundation"
use scripting additions
main()
on main() -- All the action in an ordinary handler to keep the variables local and non-persistent.
set folderPath to (path to downloads folder as text) & "For ASC forums:"
tell application "Finder" to set theFiles to items of folder folderPath as alias list -- theFiles is a list of alias(es) to one or more txt files.
set |?| to current application
-- A basic word-finding regex: finds either a run of word characters (but allowing single instances of "." or "'" between word characters or "," between digits) or one of a small collection of currency or copyright symbols. Adjust as/if necessary.
set wordsNSRegex to |?|'s class "NSRegularExpression"'s regularExpressionWithPattern:("(?:(?:\\w|(?<=\\w)[.'](?=\\w)|(?<=\\d),(?=\\d))++)|[£$€¢©®™]") options:(|?|'s NSRegularExpressionUseUnicodeWordBoundaries) |error|:(missing value)
set RussianCharacterRegex to |?|'s class "NSString"'s stringWithString:("[:script=cyrillic:]") -- Regex to find any Russian character.
set LithuanianCharacterRegex to |?|'s class "NSString"'s stringWithString:("(?i)[a˛cˇe˛sˇe˙i˛u˛u¯zˇ]") -- Regex to find any Lithuanian character.
-- Initialise variables for the word counts.
set EnWordsCount to 0
set LtWordsCount to 0
set RuWordsCount to 0
-- Go through the files in turn.
repeat with thisFile in theFiles
-- Read the text directly from the file, letting the system guess the text encoding.
set fileURL to (|?|'s class "NSURL"'s fileURLWithPath:(POSIX path of thisFile))
set MyDoc to (|?|'s class "NSString"'s stringWithContentsOfURL:(fileURL) usedEncoding:(missing value) |error|:(missing value))
set docRange to {0, MyDoc's |length|()}
-- Match and count the words (as recognised by my regex) in the document text.
set wordMatches to (wordsNSRegex's matchesInString:(MyDoc) options:(0) range:(docRange))
set WordsCount to (wordMatches's |count|())
-- Test a random word in each page (page = 230 words) to see if contains a letter from the Russian alphabet. (More precisely, see if the length of any Russian character in the range occupied by the word is greater than 0!)
set CountTrue to 0
set CountFalse to 0
repeat with j from 1 to (WordsCount - 229) by 230 -- If there are less than 230 words in the last page, they're ignored here.
set rand to j + (random number 229)
set randomWordRange to (item rand of wordMatches)'s range()
if ((MyDoc's rangeOfString:(RussianCharacterRegex) options:(|?|'s NSRegularExpressionSearch) range:(randomWordRange))'s |length| > 0) then
set CountTrue to CountTrue + 1
else
set CountFalse to CountFalse + 1
end if
end repeat
-- Update the appropriate word count according to whether more Russian characters were found than not, any Lithuanian characters were found, or none of these were found.
if (CountTrue > CountFalse) then --comparing "true" and "false" gives the ultimate resolution on whether the text is Russian (Cyrillian)
set RuWordsCount to RuWordsCount + WordsCount
else if ((MyDoc's rangeOfString:(LithuanianCharacterRegex) options:(|?|'s NSRegularExpressionSearch) range:(docRange))'s |length| > 0) then --since the Lithuanian and the English ABCs both stems from the Latin ABC we need only to check whether the text contains Lithuanian letters.
set LtWordsCount to LtWordsCount + WordsCount
else
set EnWordsCount to EnWordsCount + WordsCount
end if
end repeat
#Having made a single language lists consisting of records storing the name and the language properties we're now calculating the price with regard to words across all documents written in the same language (that is, interpreting separate docs as a single if these doc are in the same language)
set NotificationMessageEn to getNotificationMessage("English", EnWordsCount)
set NotificationMessageLt to getNotificationMessage("Lithuanian", LtWordsCount)
set NotificationMessageRu to getNotificationMessage("Russian", RuWordsCount)
{NotificationMessageEn, NotificationMessageLt, NotificationMessageRu}
end main
on getNotificationMessage(language, wordCount)
set pageCount to (wordCount / 230) as integer
if (pageCount ≤ 20) then
set docPrice to pageCount * 3
else
set docPrice to pageCount * 2
end if
if (wordCount > 0) then
return ("Language: " & language & linefeed) & ("Words count: " & wordCount & linefeed) & ("Pages count: " & pageCount & linefeed) & ("Price (Eu): " & docPrice & linefeed & linefeed)
else
return ""
end if
end getNotificationMessage
NG
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden