Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: am i loading this pdf data correctly or not?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: am i loading this pdf data correctly or not?

Subject: Re: am i loading this pdf data correctly or not?
From: "Alastair J.Houghton" <email@hidden>
Date: Wed, 6 Aug 2003 23:56:11 +0100

On Wednesday, August 6, 2003, at 11:25 pm, Ben Dougall wrote:

On Wednesday, August 6, 2003, at 03:46 pm, Marcel Weiher wrote:

No. The entire PDF file is a sequence of bytes, data. None of those byte-sequences can be regarded as text. There may (or may not) be text that is encoded in the PDF, but not in any way that you can segment it on a purely syntactic level. Instead, you have to parse/interpret the PDF (as data/bytes),

i realised that the streams in their raw form were not useable as they were, but i didn't realise they would cause outright problems. other than the streams, pdfs are ascii i think,

According to the PDF Reference Manual (which you can get for free from Adobe's web site as a PDF), PDF is indeed 7-bit ASCII, apart from:

o Inside strings (things delimited by open and close brackets).

o Inside streams (delimited by "stream" and "endstream").

So if you read it in token by token, you could indeed treat it as ASCII apart from those two special cases. (Be careful though, there is a bit of trickiness with strings and nested brackets.)

I have also noticed that some files have 8-bit characters (and even binary data) in comments, so although my quick skim of the reference manual just now didn't reveal any obvious statement that that was permissible, I'd take it as read that you need to support them in comments as well.

[snip]

you need to look at the info about the pdf objects first before you go dealing with / uncompressing the streams - that was the part i was hoping to do with regexing through pdf file.

Why use regular expressions? That seems a bit strange given that PDF is based on Postscript, which is based on Forth, and Forth/Postscript is pretty easy to parse properly. When you encounter a token that introduces a string or a stream, keep reading until you find the end, then continue reading tokens again.

Kind regards,

Alastair.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: am i loading this pdf data correctly or not?
  - From: Ben Dougall <email@hidden>
- Re: am i loading this pdf data correctly or not?
  - From: Marcel Weiher <email@hidden>

References:
	>Re: am i loading this pdf data correctly or not? (From: Ben Dougall <email@hidden>)

Prev by Date: <no subject>
Next by Date: Re: NSData disecting / stepping through?
Previous by thread: Re: am i loading this pdf data correctly or not?
Next by thread: Re: am i loading this pdf data correctly or not?
Index(es):
- Date
- Thread