Re: NSXML and invalid UTF8 characters
Re: NSXML and invalid UTF8 characters
- Subject: Re: NSXML and invalid UTF8 characters
- From: Sixten Otto <email@hidden>
- Date: Thu, 28 Jan 2010 18:30:55 -0500
On Thu, Jan 28, 2010 at 6:16 PM, Keith Blount <email@hidden> wrote:
> I am using the NSXML classes to generate and parse my own XML files. Sometimes these files store strings of text that has been brought in from other applications (for instance, there might be a plain text representation of some text the user has pasted in from Word).
For what it's worth, another common cause of problems with stuff
pasted from Word (at least on the web), is Word docs that contain
characters from the Windows-1252 character set that are invalid UTF-8
byte sequences. Most commonly, 0x80-0x9F, which is the range where
Windows-1252 differs from ISO-Latin-1.
So whatever solution you come up with to deal with the characters
0x00-0x1F that XML specifically doesn't allow, you probably want to
also account for ranges like 0x80-0xFF that aren't valid UTF-8 at all.
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
http://en.wikipedia.org/wiki/Windows-1252
Sixten
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden