Lists

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: collectdata

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: collectdata

Subject: Re: collectdata
From: Thomas Fischer <email@hidden>
Date: Tue, 07 Mar 2017 08:15:21 +0100

Hello,

some years ago I spent some time evaluating the different options for extracting the text from PDF files. The results varied, depending on the way of creation of the PDF file, be it Acrobat (Distiller), MS Word, pdflatex,…
None of the tested methods really preserved the formatting, but the main problem was the handling of non-ASCII characters, UTF-8 etc. The best method I found in this respect was pdfbox, an open source Apache project. If that doesn’t matter (as in this case, where Julien is looking for numbers), any method will do, e.g. getting it from Preview or Skim, pdftotext or ASObjC.
It would be fairly easy to find all occurrences in the text (e.g. by using BBEdit or TextWrangler) and get the found values so it can be searched for without using grep.
The problem is to find and *color* the respective hits in the pdf file.

> Am 07.03.2017 um 00:05 schrieb Shane Stanley <email@hidden>:
>
> That's true, but could be a good thing in this case. If you extract it this way page-by-page, you can find the point on the page where the text is by getting it's index in the returned string, and then calling -characterBoundsAtIndex:. With that information, it should in theory be possible to use ASObjC to add a highlight at that point on the page.

This would be interesting, but I can’t see how you get correct character bounds with mangled text. And is there an ASObjC method to edit or annotate PDF files? As I mentioned, Skim might work, but I didn’t get around to test that.

All the best
Thomas

> Am 06.03.2017 um 21:45 schrieb Christopher Stone <email@hidden>:
>
> On Mar 06, 2017, at 06:41, Yvan KOENIG <email@hidden> wrote:
>> No need for a third party tool to extract text from a PDF. This script delivered by Shane STANLEY doe the job.
>
> That is indeed useful – IF you only need the RAW output.
>
> The ASObjC code returns RAW output and the layout of the pdf is significantly mangled.
>
> It's the same as pdftotext's -raw output.
>
> pdftotext -raw "/path/to/your/File.pdf" -
>
> The primary reason to use pdftotext instead of other tools is its ability to preserve the fidelity of the PDF file's layout.  (It's the only tool I personally know of that does this.)
>
> It's not perfect, but it can make parsing a PDF document's text relatively easy instead of difficult to impossible.
>
> pdftotext -layout "/path/to/your/File.pdf" -
>
> --
> Best Regards,
> Chris
>
> _______________________________________________
> Do not post admin requests to the list. They will be ignored.
> AppleScript-Users mailing list      (email@hidden)
> Help/Unsubscribe/Update your Subscription:
> Archives: http://lists.apple.com/archives/applescript-users
>
> This email sent to email@hidden

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
AppleScript-Users mailing list      (email@hidden)
Help/Unsubscribe/Update your Subscription:
Archives: http://lists.apple.com/archives/applescript-users

This email sent to email@hidden

Follow-Ups:
- Re: collectdata
  - From: Julien Battist <email@hidden>
- Re: collectdata
  - From: Shane Stanley <email@hidden>

References:
	>collectdata (From: Julien Battist <email@hidden>)
	>Re: collectdata (From: Shane Stanley <email@hidden>)
	>Re: collectdata (From: Thomas Fischer <email@hidden>)
	>Re: collectdata (From: Julien Battist <email@hidden>)
	>Re: collectdata (From: Yvan KOENIG <email@hidden>)
	>Re: collectdata (From: Christopher Stone <email@hidden>)

Prev by Date: Re: Conversion of ISO Date String to AppleScript Date
Next by Date: Re: collectdata
Previous by thread: Re: collectdata
Next by thread: Re: collectdata
Index(es):
- Date
- Thread