Re: Normalize an NSAttributedString
Re: Normalize an NSAttributedString
- Subject: Re: Normalize an NSAttributedString
- From: Ken Thomases <email@hidden>
- Date: Sat, 29 Aug 2009 16:29:44 -0500
On Aug 29, 2009, at 3:48 PM, Ross Carter wrote:
On Aug 29, 2009, at 1:22 PM, Ken Thomases wrote:
On Aug 29, 2009, at 11:46 AM, Ross Carter wrote:
Suppose an NSAttributedString comprises the string o + umlaut in
decomposed form, plus one attribute. Its length is 2, and the
range of an attribute is {0, 2}. The string and its attribute are
archived separately as xml data like this:
<string>ö</string>
<attrName>NSFontAttributeName</attrName>
<attrValue location='0', length='2'>Helvetica 12.0</attrValue>
If, during unarchiving, the string is represented by an NSString
object in precomposed form, its length will be 1, and an attempt
to apply the attribute range of {0, 2} will fail.
But why would it change between archiving and unarchiving?
Because during unarchiving, the NSString is created by NSXMLParser
and I assume that there is no guarantee regarding the normalization
form of that string. NSXMLParser might decompose the string, for
example. It seems to me that to rely on NSXMLParser always to
returns strings in a particular form is to rely on an implementation
detail.
You can't rely on it to always return strings in a particular form.
You should be able to rely on it to return strings in the form in
which they were written.
Admittedly I have not observed any such funny business. I just
assume it is possible.
I do not. If an XML library/framework were to fail to maintain the
round-trip integrity of my data, I would consider that a bug.
Apple's NSXML documents (which, admittedly, don't quite apply to
NSXMLParser) reference <http://www.w3.org/TR/xmlschema-2/>, which
defines an XML string data type, with this definition:
The string datatype represents character strings in XML. The ·value
space· of string is the set of finite-length sequences of characters
(as defined in [XML 1.0 (Second Edition)]) that ·match· the Char
production from [XML 1.0 (Second Edition)]. A character is an atomic
unit of communication; it is not further specified except to note
that every character has a corresponding Universal Character Set
code point, which is an integer.
To me, this definition prohibits an XML parser from considering a
string as anything other than a sequence of characters. That is, it
can't apply knowledge about Unicode canonical equivalence or
decomposition, etc. You put in a sequence of characters, you get out
that sequence of characters. (The schema also defines a
normalizedString data type, but that uses a completely different sense
of normalization than we're discussing.)
Regards,
Ken
_______________________________________________
Cocoa-dev mailing list (email@hidden)
Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com
Help/Unsubscribe/Update your Subscription:
This email sent to email@hidden