Re: collectdata
Re: collectdata
- Subject: Re: collectdata
- From: Thomas Fischer <email@hidden>
- Date: Tue, 07 Mar 2017 08:15:21 +0100
Hello,
some years ago I spent some time evaluating the different options for extracting the text from PDF files. The results varied, depending on the way of creation of the PDF file, be it Acrobat (Distiller), MS Word, pdflatex,…
None of the tested methods really preserved the formatting, but the main problem was the handling of non-ASCII characters, UTF-8 etc. The best method I found in this respect was pdfbox, an open source Apache project. If that doesn’t matter (as in this case, where Julien is looking for numbers), any method will do, e.g. getting it from Preview or Skim, pdftotext or ASObjC.
It would be fairly easy to find all occurrences in the text (e.g. by using BBEdit or TextWrangler) and get the found values so it can be searched for without using grep.
The problem is to find and *color* the respective hits in the pdf file.
> Am 07.03.2017 um 00:05 schrieb Shane Stanley <email@hidden>:
>
> That's true, but could be a good thing in this case. If you extract it this way page-by-page, you can find the point on the page where the text is by getting it's index in the returned string, and then calling -characterBoundsAtIndex:. With that information, it should in theory be possible to use ASObjC to add a highlight at that point on the page.
This would be interesting, but I can’t see how you get correct character bounds with mangled text. And is there an ASObjC method to edit or annotate PDF files? As I mentioned, Skim might work, but I didn’t get around to test that.
All the best
Thomas
> Am 06.03.2017 um 21:45 schrieb Christopher Stone <email@hidden>:
>
> On Mar 06, 2017, at 06:41, Yvan KOENIG <email@hidden> wrote:
>> No need for a third party tool to extract text from a PDF. This script delivered by Shane STANLEY doe the job.
>
> That is indeed useful – IF you only need the RAW output.
>
> The ASObjC code returns RAW output and the layout of the pdf is significantly mangled.
>
> It's the same as pdftotext's -raw output.
>
> pdftotext -raw "/path/to/your/File.pdf" -
>
> The primary reason to use pdftotext instead of other tools is its ability to preserve the fidelity of the PDF file's layout. (It's the only tool I personally know of that does this.)
>
> It's not perfect, but it can make parsing a PDF document's text relatively easy instead of difficult to impossible.
>
> pdftotext -layout "/path/to/your/File.pdf" -
>
> --
> Best Regards,
> Chris
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> AppleScript-Users mailing list (email@hidden)
> Help/Unsubscribe/Update your Subscription:
> Archives: http://lists.apple.com/archives/applescript-users
>
> This email sent to email@hidden
_______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users
This email sent to email@hidden