Re: CFXMLCreateStringByUnescapingEntities() bombs on "�"
Re: CFXMLCreateStringByUnescapingEntities() bombs on "�"
- Subject: Re: CFXMLCreateStringByUnescapingEntities() bombs on "�"
- From: Quincey Morris <email@hidden>
- Date: Tue, 25 Mar 2014 10:49:26 -0700
On Mar 25, 2014, at 10:04 , Jerry Krinock <email@hidden> wrote:
> // Examine the result
> NSLog(@"bomb2 length=%ld", (long)[bomb2 length]) ;
> unichar char0 = [bomb2 characterAtIndex:0] ;
> NSLog(@"char0 = '%c' = %x = %d", char0, char0, char0) ;
> unichar char1 = [bomb2 characterAtIndex:1] ;
> NSLog(@"char1 = '%c' = %x = %d", char1, char1, char1) ;
> NSLog(@"bomb2 = '%@' THIS DOES NOT LOG AT ALL!!!", bomb2) ;
> printf("printf bomb2: %s\n", [bomb2 UTF8String]) ;
>
> Here is the result:
>
> TestApp[13859:303] bomb1 length=10
> TestApp[13859:303] bomb1 = '�'
> TestApp[13859:303] bomb2 length=2
> TestApp[13859:303] char0 = 'É' = dcc9 = 56521
> TestApp[13859:303] char1 = '-' = df2d = 57133
> printf bomb2: (null)
>
> I don’t see why CFXMLCreateStringByUnescapingEntities() is even touching bomb1, because it does not end in a semicolon. There is no HTML entity in bomb1.
>
> The two characters in bomb2, U+DCC9 and U+DF2D, are unassigned characters in the “Low Surrogates” block. Changing the number “13207494” to a slightly different value sometimes cures the problem.
You’ve got this slightly wrong. The 16-bit “characters” in a NSString aren’t Unicode characters (that is, code points). Rather, they’re UTF-16 code units. In some cases (specifically, with code units between D800 and DFFF), it takes two of these to represent one code point. Thus, in your example, it makes no sense to try to display the code units separately as characters.
> This seems to me like a bug in CFXMLCreateStringByUnescapingEntities(), and that the proper workaround would be to pre-flight its input value (bomb1) and take evasive action if necessary.
I agree this is probably a bug in CFXMLCreateStringByUnescapingEntities. It seems to have assumed a missing ‘;’ at the end of an otherwise valid escaped character entity. It probably shouldn’t make this assumption.
However, I also see this as a bug in your code, since you’re accepting “random” user input as formatted text (i.e. escaped HTML) without validation. That sort of assumption makes you prone to exploding bugs like your Core Data crash. It’s similar to buffer overflow bugs, in that not only can it cause crashes but also it can compromise system security.
Not every 32-bit value is a valid Unicode code point. Therefore, I don’t think its a *workaround* to validate your input. Since you have provided a technique that can enter any 32-bit code point, it’s a necessary step.
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden