Re: am i loading this pdf data correctly or not?
Re: am i loading this pdf data correctly or not?
- Subject: Re: am i loading this pdf data correctly or not?
- From: Tom Sutcliffe <email@hidden>
- Date: Wed, 6 Aug 2003 01:00:33 +0100
Sorry if I'm being thick but you shouldn't be dumping arbitrary binary
data into a string anyway should you?
I can't help with the regex framework but at a guess I'd say the
problem is one of encoding - you don't specify an encoding so the init
method is guessing unicode of some sort. Choose a single byte encoding
like latin1 or whatever will probably solve the problem, as unicode
contains some control-codes which could well be confusing the framework
- not to mention the fact that you don't want it to try and run bytes
together. Say the last byte of the binary section (when interpreted as
unicode) says "I'm the first byte of a 3 byte character" but it's
followed by a byte "b" (when interpreted in ASCII). The regex won't
match against the "b" because it thinks it's part of the 3 byte unicode
character.
Regards,
Tom
On Wednesday, August 6, 2003, at 12:04 am, Ben Dougall wrote:
i'm trying to parse the contents of a pdf file using a regex framework
called AGRegex. it works fine until unicode type characters appear,
then from then on it fails to get any of the expected matches. so the
regex stops dead as soon as some unicode characters appear in the pdf
data (which in actual fact was binary data rather than the unicode
representations shown below).
the regex framework is supposed to work fine with unicode (based on
pcre 4.0 - unicode compliant (if built correctly)). i think i've
either incorrectly built the regex framework, without unicode support,
or i'm incorrectly creating or setting up the string that gets passed
to the regex methods. here's the code that sets up the string from the
pdf:
NSArray *fileTypes = [NSArray arrayWithObject:@"pdf"];
NSOpenPanel *openPanel = [NSOpenPanel openPanel];
if ([openPanel runModalForDirectory:NSHomeDirectory() file:nil
types:fileTypes] == NSOKButton) {
// load pdf file
pdfData = [[NSString alloc] initWithContentsOfFile:[[openPanel
filenames] objectAtIndex:0]];
i print out the pdf data output using NSLog(@"%@", pdfData); and while
it's like...:
<<
/Type /Font
/Subtype /Type1
/Name /F0
/BaseFont /Times-Roman
/Encoding /MacRomanEncoding
>>
endobj
15 0 obj
...everything's fine (the regex gets the expected matches). as soon as
binary data occcurs, which looks like this after it's been through my
above code...:
x\\u2044\\u2260W\\u20acr\\u2030\\u2202\\021\\u02dd\\307\\u02d8\\007<%\\
372\\u2018\\016\\345;\\001\\u03a98Vv\\u25ca^{\\371\\u220f\\250\\2518UR\
\036\\256!\\247\\260\\u2248!\\253C\\351\\02
...it goes wrong (fails to get matches thereafter, even once the
binary/unicode data stops and returns to resembling the first > snippet).
just to make clear: the above second snippet of data is (was) binary
data and has been converted to that unicode style by my code that sets
the string up.
so have i done the pdfData NSString incorrectly?
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.
_______________________________________________
cocoa-dev mailing list | email@hidden
Help/Unsubscribe/Archives:
http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.