• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: am i loading this pdf data correctly or not?
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: am i loading this pdf data correctly or not?


  • Subject: Re: am i loading this pdf data correctly or not?
  • From: Ben Dougall <email@hidden>
  • Date: Thu, 7 Aug 2003 13:33:56 +0100

On Wednesday, August 6, 2003, at 11:56 pm, Alastair J.Houghton wrote:

On Wednesday, August 6, 2003, at 11:25 pm, Ben Dougall wrote:

On Wednesday, August 6, 2003, at 03:46 pm, Marcel Weiher wrote:

No. The entire PDF file is a sequence of bytes, data. None of those byte-sequences can be regarded as text. There may (or may not) be text that is encoded in the PDF, but not in any way that you can segment it on a purely syntactic level. Instead, you have to parse/interpret the PDF (as data/bytes),

i realised that the streams in their raw form were not useable as they were, but i didn't realise they would cause outright problems. other than the streams, pdfs are ascii i think,

According to the PDF Reference Manual (which you can get for free from Adobe's web site as a PDF), PDF is indeed 7-bit ASCII, apart from:

o Inside strings (things delimited by open and close brackets).

o Inside streams (delimited by "stream" and "endstream").

So if you read it in token by token, you could indeed treat it as ASCII apart from those two special cases. (Be careful though, there is a bit of trickiness with strings and nested brackets.)

yes doing it step by step and skipping if necessary during that stepping could be the answer. i'm actually thinking there maybe something wrong with the regex implementation i'm using so i'm waiting for the outcome of that at the moment.


Why use regular expressions? That seems a bit strange given that PDF is based on Postscript, which is based on Forth, and Forth/Postscript is pretty easy to parse properly. When you encounter a token that introduces a string or a stream, keep reading until you find the end, then continue reading tokens again.

because i know absolutely nothing about postscript nor forth, and i have had previous good success with regular expressions for parsing and extracting information from information (not in cocoa though, and cocoa and regex in cocoa are pretty new to me). i see no reason why regex will not work fine with pdf data though (apart from some non-text data parts unfortunately putting a spanner in regex's works and stopping it dead at the moment)

thanks.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

  • Follow-Ups:
    • What about using xpdf? (was: am i loading this pdf data correctly or not?)
      • From: Ronald Jaramillo <email@hidden>
References: 
 >Re: am i loading this pdf data correctly or not? (From: "Alastair J.Houghton" <email@hidden>)

  • Prev by Date: Re: am i loading this pdf data correctly or not?
  • Next by Date: Objects return nil
  • Previous by thread: RE: Random crash
  • Next by thread: What about using xpdf? (was: am i loading this pdf data correctly or not?)
  • Index(es):
    • Date
    • Thread