Re: reading a PDF
Re: reading a PDF
- Subject: Re: reading a PDF
- From: Ricky Sharp <email@hidden>
- Date: Sat, 29 Nov 2008 15:43:19 -0600
On Nov 29, 2008, at 3:16 PM, Torsten Curdt wrote:
I just assume that the actual content is hidden inside the page's
content stream(s).
Raw content, mostly, sometimes. But the draw commands are what put
it all
together.
For instance, you might have a paragraph of text where there is one
draw
command per line, or you might have a paragraph of text where is
one draw
command per character.
Getting to the individual draw commands for the text/characters would
be a first step ...and maybe even enough for what I am after. Is this
what the CGPDFOperatorTableSetCallback() is for?
For an image that fills the page, you might have one
content stream and one draw command, or you might have multiple
image slices
with one content stream and one draw command for each slice.
Would a PDF writer really slice the images up?
IOW, what you want is not so simple.
I see.
Well, I probably don't really need the image extraction
Just getting the text draw commands might suffice.
At my day job, we use pdfbox (see www.pdfbox.org) in automated tests.
It basically grabs raw textual data and spits out two-dimensional
arrays of strings.
While it's java based, it may shed a light on how text extraction can
be done. I do not, however, know if their licensing model will fit
your needs (i.e. if you base your code on theirs, is that even allowed).
There's some links on their site (http://www.pdfbox.org/
references.html) which shows how someone wrote a Cocoa app and used
the Java bridge to interface with pdfbox.
___________________________________________________________
Ricky A. Sharp mailto:email@hidden
Instant Interactive(tm) http://www.instantinteractive.com
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden