Re: am i loading this pdf data correctly or not?
Re: am i loading this pdf data correctly or not?
- Subject: Re: am i loading this pdf data correctly or not?
- From: Ben Dougall <email@hidden>
- Date: Thu, 7 Aug 2003 13:33:56 +0100
On Wednesday, August 6, 2003, at 11:56 pm, Alastair J.Houghton wrote:
On Wednesday, August 6, 2003, at 11:25 pm, Ben Dougall wrote:
On Wednesday, August 6, 2003, at 03:46 pm, Marcel Weiher wrote:
No. The entire PDF file is a sequence of bytes, data. None of
those byte-sequences can be regarded as text. There may (or may
not) be text that is encoded in the PDF, but not in any way that you
can segment it on a purely syntactic level. Instead, you have to
parse/interpret the PDF (as data/bytes),
i realised that the streams in their raw form were not useable as
they were, but i didn't realise they would cause outright problems.
other than the streams, pdfs are ascii i think,
According to the PDF Reference Manual (which you can get for free from
Adobe's web site as a PDF), PDF is indeed 7-bit ASCII, apart from:
o Inside strings (things delimited by open and close brackets).
o Inside streams (delimited by "stream" and "endstream").
So if you read it in token by token, you could indeed treat it as
ASCII apart from those two special cases. (Be careful though, there
is a bit of trickiness with strings and nested brackets.)
yes doing it step by step and skipping if necessary during that
stepping could be the answer. i'm actually thinking there maybe
something wrong with the regex implementation i'm using so i'm waiting
for the outcome of that at the moment.
Why use regular expressions? That seems a bit strange given that PDF
is based on Postscript, which is based on Forth, and Forth/Postscript
is pretty easy to parse properly. When you encounter a token that
introduces a string or a stream, keep reading until you find the end,
then continue reading tokens again.
because i know absolutely nothing about postscript nor forth, and i
have had previous good success with regular expressions for parsing and
extracting information from information (not in cocoa though, and cocoa
and regex in cocoa are pretty new to me). i see no reason why regex
will not work fine with pdf data though (apart from some non-text data
parts unfortunately putting a spanner in regex's works and stopping it
dead at the moment)
thanks.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.