Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: Sorting characters of the text - script doesn't work as expected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sorting characters of the text - script doesn't work as expected

Subject: Re: Sorting characters of the text - script doesn't work as expected
From: "Nigel Garvey" <email@hidden>
Date: Sun, 28 May 2017 15:35:49 +0100

Yvan KOENIG wrote on Fri, 26 May 2017 19:35:17 +0200:

>what is the true reason why the script freeze when it is asked to
execute :
>set MyDoc to text of document i
>on my machine.

On my own machine, the freeze occurs in the following line, when the
script tries to count the words. I think this may be connected with the
extraneous codes you mention:

>Minor complementary problem :
>how were the infamous extraneous couples "FC FF" inserted in the
original
>files ?

I've had a go at writing my own version of the script which doesn't use
TextEdit, uses regex to identify the words and check for Russian and
Lithuanian characters, and only generates the data it actually needs.
The regex isn't affected by the extraneous characters, so there's no
freeze and the problem characters aren't identified as words. The word
count's thus more accurate in this respect, but I don't know how similar
my "word" implementation is to that on Russian systems!


  use AppleScript version "2.4" -- Mac OS 10.10 (Yosemite) or later
  use framework "Foundation"
  use scripting additions

  main()

  on main() -- All the action in an ordinary handler to keep the variables local and non-persistent.
    set folderPath to (path to downloads folder as text) & "For ASC forums:"
    tell application "Finder" to set theFiles to items of folder folderPath as alias list -- theFiles is a list of alias(es) to one or more txt files.

    set |?| to current application
    -- A basic word-finding regex: finds either a run of word characters (but allowing single instances of "." or "'" between word characters or "," between digits) or one of a small collection of currency or copyright symbols. Adjust as/if necessary.
    set wordsNSRegex to |?|'s class "NSRegularExpression"'s regularExpressionWithPattern:("(?:(?:\\w|(?<=\\w)[.'](?=\\w)|(?<=\\d),(?=\\d))++)|[£$€¢©®™]") options:(|?|'s NSRegularExpressionUseUnicodeWordBoundaries) |error|:(missing value)
    set RussianCharacterRegex to |?|'s class "NSString"'s stringWithString:("[:script=cyrillic:]") -- Regex to find any Russian character.
    set LithuanianCharacterRegex to |?|'s class "NSString"'s stringWithString:("(?i)[a˛cˇe˛sˇe˙i˛u˛u¯zˇ]") -- Regex to find any Lithuanian character.

    -- Initialise variables for the word counts.
    set EnWordsCount to 0
    set LtWordsCount to 0
    set RuWordsCount to 0
    -- Go through the files in turn.
    repeat with thisFile in theFiles
      -- Read the text directly from the file, letting the system guess the text encoding.
      set fileURL to (|?|'s class "NSURL"'s fileURLWithPath:(POSIX path of thisFile))
      set MyDoc to (|?|'s class "NSString"'s stringWithContentsOfURL:(fileURL) usedEncoding:(missing value) |error|:(missing value))
      set docRange to {0, MyDoc's |length|()}
      -- Match and count the words (as recognised by my regex) in the document text.
      set wordMatches to (wordsNSRegex's matchesInString:(MyDoc) options:(0) range:(docRange))
      set WordsCount to (wordMatches's |count|())

      -- Test a random word in each page (page = 230 words) to see if contains a letter from the Russian alphabet. (More precisely, see if the length of any Russian character in the range occupied by the word is greater than 0!)
      set CountTrue to 0
      set CountFalse to 0
      repeat with j from 1 to (WordsCount - 229) by 230 -- If there are less than 230 words in the last page, they're ignored here.
        set rand to j + (random number 229)
        set randomWordRange to (item rand of wordMatches)'s range()
        if ((MyDoc's rangeOfString:(RussianCharacterRegex) options:(|?|'s NSRegularExpressionSearch) range:(randomWordRange))'s |length| > 0) then
          set CountTrue to CountTrue + 1
        else
          set CountFalse to CountFalse + 1
        end if
      end repeat

      -- Update the appropriate word count according to whether more Russian characters were found than not, any Lithuanian characters were found, or none of these were found.
      if (CountTrue > CountFalse) then --comparing "true" and "false" gives the ultimate resolution on whether the text is Russian (Cyrillian)
        set RuWordsCount to RuWordsCount + WordsCount
      else if ((MyDoc's rangeOfString:(LithuanianCharacterRegex) options:(|?|'s NSRegularExpressionSearch) range:(docRange))'s |length| > 0) then --since the Lithuanian and the English ABCs both stems from the Latin ABC we need only to check whether the text contains Lithuanian letters.
        set LtWordsCount to LtWordsCount + WordsCount
      else
        set EnWordsCount to EnWordsCount + WordsCount
      end if
    end repeat

    #Having made a single language lists consisting of records storing the name and the language properties we're now calculating the price with regard to words across all documents written in the same language (that is, interpreting separate docs as a single if these doc are in the same language)
    set NotificationMessageEn to getNotificationMessage("English", EnWordsCount)
    set NotificationMessageLt to getNotificationMessage("Lithuanian", LtWordsCount)
    set NotificationMessageRu to getNotificationMessage("Russian", RuWordsCount)

    {NotificationMessageEn, NotificationMessageLt, NotificationMessageRu}
  end main

  on getNotificationMessage(language, wordCount)
    set pageCount to (wordCount / 230) as integer
    if (pageCount ≤ 20) then
      set docPrice to pageCount * 3
    else
      set docPrice to pageCount * 2
    end if
    if (wordCount > 0) then
      return ("Language: " & language & linefeed) & ("Words count: " & wordCount & linefeed) & ("Pages count: " & pageCount & linefeed) & ("Price (Eu): " & docPrice & linefeed & linefeed)
    else
      return ""
    end if
  end getNotificationMessage


NG
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

Follow-Ups:
- Re: Sorting characters of the text - script doesn't work as expected
  - From: "Nigel Garvey" <email@hidden>

Prev by Date: assistive access weirdness
Next by Date: Getting class-name: Error when converting to string
Previous by thread: Re: Sorting characters of the text - script doesn't work as expected
Next by thread: Re: Sorting characters of the text - script doesn't work as expected
Index(es):
- Date
- Thread