Re: NSXML and invalid UTF8 characters
Re: NSXML and invalid UTF8 characters
- Subject: Re: NSXML and invalid UTF8 characters
- From: Keith Blount <email@hidden>
- Date: Fri, 29 Jan 2010 04:00:58 -0800 (PST)
Hi Jens,
Many thanks again for the help. Sorry I wasn't clearer about what I meant when I said "invalid UTF8" - I was using it in the context of XML, but should have made that more explicit. Also, I owe you an apology as I had completely missed NSMutableCharacterSet's -addCharactersInRange:, which as you say is exactly what I needed; I should have re-checked the NSCharacterSet methods after your first reply.
So, I hope I have a solution. I use NSMutableCharacterSet to create a character set containing all valid XML unicode characters, then invert it so I have all invalid characters, then check for these invalid characters and delete them. My NSString category method is below:
- (NSString *)validXMLString
{
// Not all UTF8 characters are valid XML.
// See:
// http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
// (Also see: http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html )
//
// The ranges of unicode characters allowed, as specified above, are:
// Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
//
// To ensure the string is valid for XML encoding, we therefore need to remove any characters that
// do not fall within the above ranges.
// First create a character set containing all invalid XML characters.
// Create this once and leave it in memory so that we can reuse it rather
// than recreate it every time we need it.
static NSCharacterSet *invalidXMLCharacterSet = nil;
if (invalidXMLCharacterSet == nil)
{
// First, create a character set containing all valid UTF8 characters.
NSMutableCharacterSet *XMLCharacterSet = [[NSMutableCharacterSet alloc] init];
[XMLCharacterSet addCharactersInRange:NSMakeRange(0x9, 1)];
[XMLCharacterSet addCharactersInRange:NSMakeRange(0xA, 1)];
[XMLCharacterSet addCharactersInRange:NSMakeRange(0xD, 1)];
[XMLCharacterSet addCharactersInRange:NSMakeRange(0x20, 0xD7FF - 0x20)];
[XMLCharacterSet addCharactersInRange:NSMakeRange(0xE000, 0xFFFD - 0xE000)];
[XMLCharacterSet addCharactersInRange:NSMakeRange(0x10000, 0x10FFFF - 0x10000)];
// Then create and retain an inverted set, which will thus contain all invalid XML characters.
invalidXMLCharacterSet = [[XMLCharacterSet invertedSet] retain];
[XMLCharacterSet release];
}
// Are there any invalid characters in this string?
NSRange range = [self rangeOfCharacterFromSet:invalidXMLCharacterSet];
// If not, just return self unaltered.
if (range.length == 0)
return self;
// Otherwise go through and remove any illegal XML characters from a copy of the string.
NSMutableString *cleanedString = [self mutableCopy];
while (range.length > 0)
{
[cleanedString deleteCharactersInRange:range];
range = [cleanedString rangeOfCharacterFromSet:invalidXMLCharacterSet];
}
return (NSString *)[cleanedString autorelease];
}
As the invalid character set is only created once, as as nothing is done if the string has no invalid XML characters, this seems to run pretty fast and do what I need.
Many thanks again!
All the best,
Keith
--- On Fri, 1/29/10, Jens Alfke <email@hidden> wrote:
> From: Jens Alfke <email@hidden>
> Subject: Re: NSXML and invalid UTF8 characters
> To: "Keith Blount" <email@hidden>
> Cc: email@hidden
> Date: Friday, January 29, 2010, 3:23 AM
>
> On Jan 28, 2010, at 3:47 PM, Keith Blount wrote:
>
> > Many thanks for your reply. Wouldn't using these
> methods be a lot more expensive (and slower) than going
> through using -characterAtIndex: or something similar,
> accessing the characters directly, though?
>
> No, because it's more efficient to let NSString itself do
> the searching, avoiding the overhead of a message-send per
> character.
>
> > I'm thinking that I would have to add every character
> to the character set and then let NSString deal with all the
> underlying character stuff this way, whereas if I could
> check the unicode char is within a range then it would be
> faster.
>
> You can easily create an NSCharacterSet on any range of
> Unicode values.
>
> BTW, it's inaccurate to say "invalid UTF-8". UTF-8 is just
> an encoding of Unicode. You're talking about Unicode
> characters that are illegal in XML. (I bring this up because
> there is such a thing as invalid UTF-8, i.e. byte sequences
> that are invalid in UTF-8 encoding, but it's an entirely
> different issue; this confused me when I first read your
> message.)
>
> —Jens
>
>
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden