• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: collectdata
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: collectdata


  • Subject: Re: collectdata
  • From: Julien Battist <email@hidden>
  • Date: Tue, 07 Mar 2017 10:02:21 +0100
  • Importance: normal
  • Sensitivity: Normal

---- in applescript.... since I am not managing any other language :)
 
Sent: Tuesday, March 07, 2017 at 9:58 AM
From: "Julien Battist" <email@hidden>
To: "Thomas Fischer" <email@hidden>
Cc: "Applescript Users List" <email@hidden>
Subject: Re: collectdata
Hi Thomas and all,
 
Very interesting how this treat goes :) and I learned out of it... for later purposes.
Like Shane also mentioned the page number extraction is also one of my targets but as I mentioned I didn't want to go into too much detail to prevent drifting away from the main subject.
 
Target:
--- find a regular _expression_ (Alphanumeric...) ---> push it to a flat file
--- mark hit clearly on the pdf.
 
It seems not possible, is that the conclusion?
 
Many thanks all, Julien
 
 
 
 
 
 
Sent: Tuesday, March 07, 2017 at 8:15 AM
From: "Thomas Fischer" <email@hidden>
To: "Applescript Users List" <email@hidden>
Subject: Re: collectdata
Hello,

some years ago I spent some time evaluating the different options for extracting the text from PDF files. The results varied, depending on the way of creation of the PDF file, be it Acrobat (Distiller), MS Word, pdflatex,…
None of the tested methods really preserved the formatting, but the main problem was the handling of non-ASCII characters, UTF-8 etc. The best method I found in this respect was pdfbox, an open source Apache project. If that doesn’t matter (as in this case, where Julien is looking for numbers), any method will do, e.g. getting it from Preview or Skim, pdftotext or ASObjC.
It would be fairly easy to find all occurrences in the text (e.g. by using BBEdit or TextWrangler) and get the found values so it can be searched for without using grep.
The problem is to find and *color* the respective hits in the pdf file.

> Am 07.03.2017 um 00:05 schrieb Shane Stanley <email@hidden>:
>
> That's true, but could be a good thing in this case. If you extract it this way page-by-page, you can find the point on the page where the text is by getting it's index in the returned string, and then calling -characterBoundsAtIndex:. With that information, it should in theory be possible to use ASObjC to add a highlight at that point on the page.

This would be interesting, but I can’t see how you get correct character bounds with mangled text. And is there an ASObjC method to edit or annotate PDF files? As I mentioned, Skim might work, but I didn’t get around to test that.

All the best
Thomas


> Am 06.03.2017 um 21:45 schrieb Christopher Stone <email@hidden>:
>
> On Mar 06, 2017, at 06:41, Yvan KOENIG <email@hidden> wrote:
>> No need for a third party tool to extract text from a PDF. This script delivered by Shane STANLEY doe the job.
>
> That is indeed useful – IF you only need the RAW output.
>
> The ASObjC code returns RAW output and the layout of the pdf is significantly mangled.
>
> It's the same as pdftotext's -raw output.
>
> pdftotext -raw "/path/to/your/File.pdf" -
>
> The primary reason to use pdftotext instead of other tools is its ability to preserve the fidelity of the PDF file's layout. (It's the only tool I personally know of that does this.)
>
> It's not perfect, but it can make parsing a PDF document's text relatively easy instead of difficult to impossible.
>
> pdftotext -layout "/path/to/your/File.pdf" -
>
> --
> Best Regards,
> Chris
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> AppleScript-Users mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
> Archives: http://lists.apple.com/archives/applescript-users
>
> This email sent to email@hidden


_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden
 _______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

References: 
 >collectdata (From: Julien Battist <email@hidden>)
 >Re: collectdata (From: Shane Stanley <email@hidden>)
 >Re: collectdata (From: Thomas Fischer <email@hidden>)
 >Re: collectdata (From: Julien Battist <email@hidden>)
 >Re: collectdata (From: Yvan KOENIG <email@hidden>)
 >Re: collectdata (From: Christopher Stone <email@hidden>)
 >Re: collectdata (From: Thomas Fischer <email@hidden>)
 >Re: collectdata (From: Julien Battist <email@hidden>)

  • Prev by Date: Re: collectdata
  • Next by Date: Re: AppleScript Versions per Iteration of OSX?
  • Previous by thread: Re: collectdata
  • Next by thread: Re: collectdata
  • Index(es):
    • Date
    • Thread