• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: reading a PDF
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: reading a PDF


  • Subject: Re: reading a PDF
  • From: Ricky Sharp <email@hidden>
  • Date: Sat, 29 Nov 2008 15:43:19 -0600


On Nov 29, 2008, at 3:16 PM, Torsten Curdt wrote:

I just assume that the actual content is hidden inside the page's
content stream(s).

Raw content, mostly, sometimes. But the draw commands are what put it all
together.


For instance, you might have a paragraph of text where there is one draw
command per line, or you might have a paragraph of text where is one draw
command per character.

Getting to the individual draw commands for the text/characters would be a first step ...and maybe even enough for what I am after. Is this what the CGPDFOperatorTableSetCallback() is for?

For an image that fills the page, you might have one
content stream and one draw command, or you might have multiple image slices
with one content stream and one draw command for each slice.

Would a PDF writer really slice the images up?

IOW, what you want is not so simple.

I see.

Well, I probably don't really need the image extraction
Just getting the text draw commands might suffice.


At my day job, we use pdfbox (see www.pdfbox.org) in automated tests. It basically grabs raw textual data and spits out two-dimensional arrays of strings.

While it's java based, it may shed a light on how text extraction can be done. I do not, however, know if their licensing model will fit your needs (i.e. if you base your code on theirs, is that even allowed).

There's some links on their site (http://www.pdfbox.org/ references.html) which shows how someone wrote a Cocoa app and used the Java bridge to interface with pdfbox.

___________________________________________________________
Ricky A. Sharp         mailto:email@hidden
Instant Interactive(tm)   http://www.instantinteractive.com



_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden


  • Follow-Ups:
    • Re: reading a PDF
      • From: "Torsten Curdt" <email@hidden>
References: 
 >reading a PDF (From: "Torsten Curdt" <email@hidden>)
 >Re: reading a PDF (From: Scott Ribe <email@hidden>)
 >Re: reading a PDF (From: "Torsten Curdt" <email@hidden>)

  • Prev by Date: Re: reading a PDF
  • Next by Date: Re: Keystrokes for non-ascii letters
  • Previous by thread: Re: reading a PDF
  • Next by thread: Re: reading a PDF
  • Index(es):
    • Date
    • Thread