Re: am i loading this pdf data correctly or not?
Re: am i loading this pdf data correctly or not?
- Subject: Re: am i loading this pdf data correctly or not?
- From: Ben Dougall <email@hidden>
- Date: Thu, 7 Aug 2003 14:57:51 +0100
On Thursday, August 7, 2003, at 02:23 pm, Marcel Weiher wrote:
Since the streams are random binary junk, you can't ignore them by
parsing through them. After all, it is perfectly permissible for
them to contain the character sequence "endstream".
at the moment i'm not in anyway attempting to parse it with any pdf
semantics in mind
Yes. That is the problem. PDF is very difficult to parse without
keeping the semantics in mind. Really.
(the data could have a 100 'endstream's in - wouldn't make any
difference because i'm not looking out for endstream yet at all - i'm
just starting doing this so these are initial steps). the data that's
between stream and endstream contains something that messes the
regular expression's operation up (not messes up my matching pattern
but the whole operation - it stops) - maybe there's a bug in the
regex i'm using? obviously regex doesn't care about pdf semantics.
something in the particular stream of data is causing regex to break
/ stop. i think there may well be a bug in the regex i'm using. i've
described this to the person who wrote the regex cocoa wrapper that
i'm using and they were perplexed by the regex being stopped in the
data part and asked me to send the code and file i'm parsing which i
did yesterday so i'm waiting for the outcome of that.
seeing as my code did get all the pdf data into an NSString (maybe
incorrectly as the data between stream and endstream looked like ...
\\001\\u03a98Vv\\u25ca^{\\371\\u220f\\2... after import which is very
different to how it looks in the original pdf data) the regex
shouldn't be stopped by some data like that i don't think? it maybe
incorrect data but that shouldn't make a jot of difference to the
regex operation / implementation itself - it should carry on through
/ past that.
The problem is that NSString (and any Unicode ompatible regex based on
NSString) will attach semantics to character sequences.
NSString workings and details yes, but not pdf semantics (refering to
your first statement). not pdf semantics right now, in this specific
focussed point / problem that i'm talking about:
say i just wanted to use regex, for example, to count the number of
occurrences of the three characters 'obj' in some data (that just
*happens* to be from a pdf data file). pdf semantics are neither here
nor there. when doing this count the regex correctly counts the
occurrence of the pattern i'm searching for up to a certain point. it
stops, due to i think, a particular block of data that's in the
NSString. the particular part of the NSString that makes the regex
stops looks like ... \\001\\u03a98Vv\\u25ca^{\\371\\u220f\\2... so this
would be a problem of my lack of understanding of NSStrings because i'd
expect those character not to cause a problem to NSString nor a regex.
in fact it doesn't seem to cause NSString a problem, only the regex.
the fine tuned to-the-point question is this. forget pdfs. why does
data in an NSString that looks like this :
... \\001\\u03a98Vv\\u25ca^{\\371\\u220f\\2...
not break an NSString (not break as in when i NSLog output it the whole
thing outputs - from start to finish), but break a regex parse (break
as in it stops prematurely)?
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.