Re: NSXML and invalid UTF8 characters
Re: NSXML and invalid UTF8 characters
- Subject: Re: NSXML and invalid UTF8 characters
- From: Keith Blount <email@hidden>
- Date: Thu, 28 Jan 2010 15:43:43 -0800 (PST)
Thanks for the heads up. Actually this came up recently - a character had pasted some Word characters into my app, including a non-valid UTF8 one, and it was throwing exceptions on loading the text storage (it turned out that using -replaceCharactersInRange:withAttributedString after init'ing the NSTextStorage rather than using -initWithAttributedString: fixed the exceptions, but I reported it to Apple as obviously it can cause problems with the text system.
I think the XML specs page I cited lists the actual valid UTF8 ranges, though, so as long as I can find a way just to include characters from within them I should be good, I think.
I'm assuming the solution will be to cycle through all characters in the string using -characterAtIndex:, checking the character is within the valid ranges, but I'm not entirely sure of the best way of doing this, even though I'm sure it seems simple to those more grounded in C.
Thanks again.
All the best,
Keith
--- On Thu, 1/28/10, Sixten Otto <email@hidden> wrote:
> From: Sixten Otto <email@hidden>
> Subject: Re: NSXML and invalid UTF8 characters
> To: "Keith Blount" <email@hidden>
> Cc: email@hidden
> Date: Thursday, January 28, 2010, 11:30 PM
> On Thu, Jan 28, 2010 at 6:16 PM,
> Keith Blount <email@hidden>
> wrote:
> > I am using the NSXML classes to generate and parse my
> own XML files. Sometimes these files store strings of text
> that has been brought in from other applications (for
> instance, there might be a plain text representation of some
> text the user has pasted in from Word).
>
> For what it's worth, another common cause of problems with
> stuff
> pasted from Word (at least on the web), is Word docs that
> contain
> characters from the Windows-1252 character set that are
> invalid UTF-8
> byte sequences. Most commonly, 0x80-0x9F, which is the
> range where
> Windows-1252 differs from ISO-Latin-1.
>
> So whatever solution you come up with to deal with the
> characters
> 0x00-0x1F that XML specifically doesn't allow, you probably
> want to
> also account for ranges like 0x80-0xFF that aren't valid
> UTF-8 at all.
>
> http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
> http://en.wikipedia.org/wiki/Windows-1252
>
> Sixten
>
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden