• Open Menu Close Menu
  • Apple
  • Shopping Bag
  • Apple
  • Mac
  • iPad
  • iPhone
  • Watch
  • TV
  • Music
  • Support
  • Search apple.com
  • Shopping Bag

Lists

Open Menu Close Menu
  • Terms and Conditions
  • Lists hosted on this site
  • Email the Postmaster
  • Tips for posting to public mailing lists
Re: NSXML and invalid UTF8 characters
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSXML and invalid UTF8 characters


  • Subject: Re: NSXML and invalid UTF8 characters
  • From: Keith Blount <email@hidden>
  • Date: Fri, 29 Jan 2010 04:00:58 -0800 (PST)

Hi Jens,

Many thanks again for the help. Sorry I wasn't clearer about what I meant when I said "invalid UTF8" - I was using it in the context of XML, but should have made that more explicit. Also, I owe you an apology as I had completely missed NSMutableCharacterSet's -addCharactersInRange:, which as you say is exactly what I needed; I should have re-checked the NSCharacterSet methods after your first reply.

So, I hope I have a solution. I use NSMutableCharacterSet to create a character set containing all valid XML unicode characters, then invert it so I have all invalid characters, then check for these invalid characters and delete them. My NSString category method is below:

- (NSString *)validXMLString
{
	// Not all UTF8 characters are valid XML.
	// See:
	// http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
	// (Also see: http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html )
	//
	// The ranges of unicode characters allowed, as specified above, are:
	// Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
	//
	// To ensure the string is valid for XML encoding, we therefore need to remove any characters that
	// do not fall within the above ranges.

	// First create a character set containing all invalid XML characters.
	// Create this once and leave it in memory so that we can reuse it rather
	// than recreate it every time we need it.
	static NSCharacterSet *invalidXMLCharacterSet = nil;

	if (invalidXMLCharacterSet == nil)
	{
		// First, create a character set containing all valid UTF8 characters.
		NSMutableCharacterSet *XMLCharacterSet = [[NSMutableCharacterSet alloc] init];
		[XMLCharacterSet addCharactersInRange:NSMakeRange(0x9, 1)];
		[XMLCharacterSet addCharactersInRange:NSMakeRange(0xA, 1)];
		[XMLCharacterSet addCharactersInRange:NSMakeRange(0xD, 1)];
		[XMLCharacterSet addCharactersInRange:NSMakeRange(0x20, 0xD7FF - 0x20)];
		[XMLCharacterSet addCharactersInRange:NSMakeRange(0xE000, 0xFFFD - 0xE000)];
		[XMLCharacterSet addCharactersInRange:NSMakeRange(0x10000, 0x10FFFF - 0x10000)];

		// Then create and retain an inverted set, which will thus contain all invalid XML characters.
		invalidXMLCharacterSet = [[XMLCharacterSet invertedSet] retain];
		[XMLCharacterSet release];
	}

	// Are there any invalid characters in this string?
	NSRange range = [self rangeOfCharacterFromSet:invalidXMLCharacterSet];

	// If not, just return self unaltered.
	if (range.length == 0)
		return self;

	// Otherwise go through and remove any illegal XML characters from a copy of the string.
	NSMutableString *cleanedString = [self mutableCopy];

	while (range.length > 0)
	{
		[cleanedString deleteCharactersInRange:range];
		range = [cleanedString rangeOfCharacterFromSet:invalidXMLCharacterSet];
	}

	return (NSString *)[cleanedString autorelease];
}

As the invalid character set is only created once, as as nothing is done if the string has no invalid XML characters, this seems to run pretty fast and do what I need.

Many thanks again!
All the best,
Keith

--- On Fri, 1/29/10, Jens Alfke <email@hidden> wrote:

> From: Jens Alfke <email@hidden>
> Subject: Re: NSXML and invalid UTF8 characters
> To: "Keith Blount" <email@hidden>
> Cc: email@hidden
> Date: Friday, January 29, 2010, 3:23 AM
>
> On Jan 28, 2010, at 3:47 PM, Keith Blount wrote:
>
> > Many thanks for your reply. Wouldn't using these
> methods be a lot more expensive (and slower) than going
> through using -characterAtIndex: or something similar,
> accessing the characters directly, though?
>
> No, because it's more efficient to let NSString itself do
> the searching, avoiding the overhead of a message-send per
> character.
>
> > I'm thinking that I would have to add every character
> to the character set and then let NSString deal with all the
> underlying character stuff this way, whereas if I could
> check the unicode char is within a range then it would be
> faster.
>
> You can easily create an NSCharacterSet on any range of
> Unicode values.
>
> BTW, it's inaccurate to say "invalid UTF-8". UTF-8 is just
> an encoding of Unicode. You're talking about Unicode
> characters that are illegal in XML. (I bring this up because
> there is such a thing as invalid UTF-8, i.e. byte sequences
> that are invalid in UTF-8 encoding, but it's an entirely
> different issue; this confused me when I first read your
> message.)
>
> —Jens
>
>



_______________________________________________

Cocoa-dev mailing list (email@hidden)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:

This email sent to email@hidden

  • Follow-Ups:
    • Re: NSXML and invalid UTF8 characters
      • From: Jens Alfke <email@hidden>
References: 
 >Re: NSXML and invalid UTF8 characters (From: Jens Alfke <email@hidden>)

  • Prev by Date: Re: NSKeyValueBindingCreation Leak
  • Next by Date: Re: NSBundle unloading crash
  • Previous by thread: Re: NSXML and invalid UTF8 characters
  • Next by thread: Re: NSXML and invalid UTF8 characters
  • Index(es):
    • Date
    • Thread