Re: am i loading this pdf data correctly or not?
Re: am i loading this pdf data correctly or not?
- Subject: Re: am i loading this pdf data correctly or not?
- From: "Alastair J.Houghton" <email@hidden>
- Date: Wed, 6 Aug 2003 23:56:11 +0100
On Wednesday, August 6, 2003, at 11:25 pm, Ben Dougall wrote:
On Wednesday, August 6, 2003, at 03:46 pm, Marcel Weiher wrote:
No. The entire PDF file is a sequence of bytes, data. None of those
byte-sequences can be regarded as text. There may (or may not) be
text that is encoded in the PDF, but not in any way that you can
segment it on a purely syntactic level. Instead, you have to
parse/interpret the PDF (as data/bytes),
i realised that the streams in their raw form were not useable as they
were, but i didn't realise they would cause outright problems. other
than the streams, pdfs are ascii i think,
According to the PDF Reference Manual (which you can get for free from
Adobe's web site as a PDF), PDF is indeed 7-bit ASCII, apart from:
o Inside strings (things delimited by open and close brackets).
o Inside streams (delimited by "stream" and "endstream").
So if you read it in token by token, you could indeed treat it as ASCII
apart from those two special cases. (Be careful though, there is a bit
of trickiness with strings and nested brackets.)
I have also noticed that some files have 8-bit characters (and even
binary data) in comments, so although my quick skim of the reference
manual just now didn't reveal any obvious statement that that was
permissible, I'd take it as read that you need to support them in
comments as well.
[snip]
you need to look at the info about the pdf objects first before you go
dealing with / uncompressing the streams - that was the part i was
hoping to do with regexing through pdf file.
Why use regular expressions? That seems a bit strange given that PDF
is based on Postscript, which is based on Forth, and Forth/Postscript
is pretty easy to parse properly. When you encounter a token that
introduces a string or a stream, keep reading until you find the end,
then continue reading tokens again.
Kind regards,
Alastair.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.