Re: NSXML and invalid UTF8 characters
Re: NSXML and invalid UTF8 characters
- Subject: Re: NSXML and invalid UTF8 characters
- From: Keith Blount <email@hidden>
- Date: Thu, 28 Jan 2010 16:29:20 -0800 (PST)
As an update, I tried this, which seems to partially work:
- (NSString *)stringCleanedForXML // in an NSString category
{
unichar character;
NSInteger index, len = [self length];
NSMutableString *cleanedString = [[NSMutableString alloc] init];
for (index = 0; index < len; index++)
{
character = [self characterAtIndex:index];
if (character == 0x9 ||
character == 0xA ||
character == 0xD ||
(character >= 0x20 && character <= 0xD7FF) ||
(character >= 0xE000 && character <= 0xFFFD) ||
(character >= 0x10000 && character <= 0x10FFFF))
[cleanedString appendFormat:@"%C", character];
}
return [cleanedString autorelease];
}
Using this saved my XML strings in such a way as they didn't produce errors on loading, but this line:
(character >= 0x10000 && character <= 0x10FFFF)
Throws up this compiler warning:
"Comparison is always false due (they mean "owing"... :) ) to limited range of data type."
But I got these ranges from the XML site:
http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
and based the above method on non-Cocoa code here:
http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html
Obviously it's down to my misunderstanding though. So my questions are now:
a) Why am I getting this error (i.e. what dunderheaded thing am I doing wrong)?
b) Does the above achieve what I wanted and create a string containing only the UTF8 characters specified in the XML docs cited above?
c) Is this the fastest way of doing it or is there a faster way?
Thanks again!
All the best,
Keith
--- Original e-mail ---
Hello,
I am using the NSXML classes to generate and parse my own XML files. Sometimes these files store strings of text that has been brought in from other applications (for instance, there might be a plain text representation of some text the user has pasted in from Word).
In some instances I am receiving errors in NSXMLDocument's -initWithContentsOfURLPreservingWhitespace:error:, causing it to return nil with errors such as "Char 0x0 out of allowed range" or "PCDATA invalid char value 12". As I understand it, this is because XML doesn't allow certain ranges of UTF8 characters:
http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
Especially:
Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Certainly, the "PCData invalid char" error was caused by an NSFormFeedCharacter - I don't know what the "Char 0x0" character is, but it's bound to be one from a Word document that isn't allowed.
So, my question is, what is the best way for me to filter out these invalid characters from my NSString before I pass it into NSXMLElement's -initWithName:stringValue: or similar methods, to avoid creating XML documents that won't open?
This page seems useful:
http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html
It would seem to indicate that I would need to write some code in C to compile a string without the invalid characters, and build it into an NSString, but I was wondering if there were any methods built into the AppKit that already strip these invalid XML characters? I have looked but couldn't see any. If not, if anyone could give me any pointers on using the above info to create a method that would do this, I would be very grateful. I'm self-taught so all my knowledge is high-level Cocoa and Objective-C, so I'd end up doing it all using NSString -appendString, -stringWithFormat: methods, which I know would be wrong for this as it would be too slow and requires C.
Many thanks in advance for any help anyone can give.
All the best,
Keith
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden