Lists

Open Menu Close Menu

Terms and Conditions
Lists hosted on this site
Email the Postmaster
Tips for posting to public mailing lists

Re: am i loading this pdf data correctly or not?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: am i loading this pdf data correctly or not?

Subject: Re: am i loading this pdf data correctly or not?
From: Ben Dougall <email@hidden>
Date: Wed, 6 Aug 2003 23:25:28 +0100

On Wednesday, August 6, 2003, at 03:46 pm, Marcel Weiher wrote:

On Wednesday, Aug 6, 2003, at 15:26 Europe/London, Ben Dougall wrote:

so pdfs are made up of both text data and non-text data? and non-text data should not be put in an NSString (that makes sense i suppose :) > ).

No. The entire PDF file is a sequence of bytes, data. None of those byte-sequences can be regarded as text. There may (or may not) be text that is encoded in the PDF, but not in any way that you can segment it on a purely syntactic level. Instead, you have to parse/interpret the PDF (as data/bytes),

i realised that the streams in their raw form were not useable as they were, but i didn't realise they would cause outright problems. other than the streams, pdfs are ascii i think, or maybe an 8bit char set. i was hoping / expecting the streams to be just ignored.

so parsing pdf data straight from the file with regular expressions is not on, full stop.

Yes. You need to at least take into account the binary streams that are embedded in the PDF document structure, in order to ignore those. For that, you have to parse the PDF document structure (via the xref table and the objects). Once you have that, you have PDF objects and binary streams. The PDF objects then tell you how you can parse the binary data streams to get at the actual contents of the PDF file.

(Virtually all relevant data is in those streams).

yes i realise that, but you need to look at the info about the pdf objects first before you go dealing with / uncompressing the streams - that was the part i was hoping to do with regexing through pdf file.

in order to do that i'd have to first extract or block out or something the data (non-text data that is) so as to make sure that i do not give data (non-text) to the regular expression methods? i need to somehow parse the NSData first before regexing.

You need code that fully understands PDF, unless you only need some very specialized data that may be residing in the objects.

i'm only after certain parts of the pdf but i need to be able to get an over view of the whole pdf via it's object stuff first. as i said, i was expecting the regex to just skip over and ignore the streams - that would have been fine, but it seems i have to physically strip out those streams first before i can go about parsing the whole file.

thanks for the info.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Follow-Ups:
- Re: am i loading this pdf data correctly or not?
  - From: "Alastair J.Houghton" <email@hidden>

References:
	>Re: am i loading this pdf data correctly or not? (From: Marcel Weiher <email@hidden>)

Prev by Date: Re: Deactivate Current App
Next by Date: Re: NSData disecting / stepping through?
Previous by thread: Re: am i loading this pdf data correctly or not?
Next by thread: Re: am i loading this pdf data correctly or not?
Index(es):
- Date
- Thread