Re: am i loading this pdf data correctly or not?
Re: am i loading this pdf data correctly or not?
- Subject: Re: am i loading this pdf data correctly or not?
- From: Ben Dougall <email@hidden>
- Date: Wed, 6 Aug 2003 23:25:28 +0100
On Wednesday, August 6, 2003, at 03:46 pm, Marcel Weiher wrote:
On Wednesday, Aug 6, 2003, at 15:26 Europe/London, Ben Dougall wrote:
so pdfs are made up of both text data and non-text data? and non-text
data should not be put in an NSString (that makes sense i suppose :)
> ).
No. The entire PDF file is a sequence of bytes, data. None of those
byte-sequences can be regarded as text. There may (or may not) be
text that is encoded in the PDF, but not in any way that you can
segment it on a purely syntactic level. Instead, you have to
parse/interpret the PDF (as data/bytes),
i realised that the streams in their raw form were not useable as they
were, but i didn't realise they would cause outright problems. other
than the streams, pdfs are ascii i think, or maybe an 8bit char set. i
was hoping / expecting the streams to be just ignored.
so parsing pdf data straight from the file with regular expressions
is not on, full stop.
Yes. You need to at least take into account the binary streams that
are embedded in the PDF document structure, in order to ignore those.
For that, you have to parse the PDF document structure (via the xref
table and the objects). Once you have that, you have PDF objects and
binary streams. The PDF objects then tell you how you can parse the
binary data streams to get at the actual contents of the PDF file.
(Virtually all relevant data is in those streams).
yes i realise that, but you need to look at the info about the pdf
objects first before you go dealing with / uncompressing the streams -
that was the part i was hoping to do with regexing through pdf file.
in order to do that i'd have to first extract or block out or
something the data (non-text data that is) so as to make sure that i
do not give data (non-text) to the regular expression methods? i need
to somehow parse the NSData first before regexing.
You need code that fully understands PDF, unless you only need some
very specialized data that may be residing in the objects.
i'm only after certain parts of the pdf but i need to be able to get an
over view of the whole pdf via it's object stuff first. as i said, i
was expecting the regex to just skip over and ignore the streams - that
would have been fine, but it seems i have to physically strip out those
streams first before i can go about parsing the whole file.
thanks for the info.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.