Re: NSXMLParser and character entities?
Re: NSXMLParser and character entities?
- Subject: Re: NSXMLParser and character entities?
- From: Kai <email@hidden>
- Date: Mon, 15 Sep 2008 21:45:53 +0200
On 14.9.2008, at 10:45, Nathan Kinsinger wrote:
On Sep 12, 2008, at 3:56 PM, Kai wrote:
When NSXMLParser hits a character entity like ä (-> German
umlaut 'ä'), it sends parser:resolveExternalEntityName:systemID: to
its delegate and if this is not implemented or returns nil,
parser:parseErrorOccurred: is called with
NSXMLParserUndeclaredEntityError.
Am I supposed to resolve all these character entities myself? And
if so, what should the NSData object returned by
parser:resolveExternalEntityName:systemID: contain? Unicode? Which
Unicode encoding?
But this can’t be, can it? I must be missing something simple.
Thanks for any hints
Kai
The main problem is that entities like ä are defined by HTML
and have nothing to do with XML or NSXMLParser.
Understood.
I haven't dealt with this problem myself but I was curious so I
tried a few things.
My first attempt was using NSAttributedString to convert the HTML
entity to a UTF8 string.
- (NSData *)parser:(NSXMLParser *)parser resolveExternalEntityName:
(NSString *)entityName systemID:(NSString *)systemID
{
NSAttributedString *entityString = [[[NSAttributedString alloc]
initWithHTML:[[NSString stringWithFormat:@"&%@;", entityName]
dataUsingEncoding:NSUTF8StringEncoding] documentAttributes:NULL]
autorelease];
NSLog(@"resolved entity name: %@", [entityString string]);
return [[entityString string]
dataUsingEncoding:NSUTF8StringEncoding];
}
This works, parser:foundCharacters: gets the ä but for some reason
parser:parseErrorOccurred: is still being called with the same error
you received: "Operation could not be completed.
(NSXMLParserErrorDomain error 26.)"
The parser does continue and parse the file correctly (with the ä),
it just makes it hard to tell when you have real errors. I'm really
curious as to why this doesn't work (running 10.5.4 on Intel). And
the fact that the parser keeps parsing after the error, when the
documentation says it will stop, is odd too.
That’s indeed odd behavior. Guess I’ll better file a bug.
Another option is to add an XHTML DocType DTD to the file and set
setShouldResolveExternalEntities: to YES (default is NO). This works
with no errors because the DTD defines the entities.
However NSXMLParser will download the DTD (over the net) every time
you parse a file. So you probably want to copy one of the DTD's (say http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
) locally. Although I didn't try it, you could copy the entity
definitions into your own DTD to make the file smaller and parsing
it faster.
That should work well, thanks for the hint.
Of course if the content really is XHTML you should really be using
an HTML parser and not an XML one.
No, it isn’t. Just needs some way to encode all German characters.
I’ll have to investigate whether simply using utf8 encoding is an
option, though.
--Nathan
Thanks a lot for your very helpful answer
Kai_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden