On Aug 29, 2009, at 3:48 PM, Ross Carter wrote:

On Aug 29, 2009, at 1:22 PM, Ken Thomases wrote:

On Aug 29, 2009, at 11:46 AM, Ross Carter wrote:

Suppose an NSAttributedString comprises the string o + umlaut in decomposed form, plus one attribute. Its length is 2, and the range of an attribute is {0, 2}. The string and its attribute are archived separately as xml data like this:
<string>ö</string>
<attrName>NSFontAttributeName</attrName>
<attrValue location='0', length='2'>Helvetica 12.0</attrValue>

If, during unarchiving, the string is represented by an NSString object in precomposed form, its length will be 1, and an attempt to apply the attribute range of {0, 2} will fail.

But why would it change between archiving and unarchiving?

Because during unarchiving, the NSString is created by NSXMLParser and I assume that there is no guarantee regarding the normalization form of that string. NSXMLParser might decompose the string, for example. It seems to me that to rely on NSXMLParser always to returns strings in a particular form is to rely on an implementation detail.

You can't rely on it to always return strings in a particular form. You should be able to rely on it to return strings in the form in which they were written.

Admittedly I have not observed any such funny business. I just assume it is possible.

I do not. If an XML library/framework were to fail to maintain the round-trip integrity of my data, I would consider that a bug.

Apple's NSXML documents (which, admittedly, don't quite apply to NSXMLParser) reference <http://www.w3.org/TR/xmlschema-2/>, which defines an XML string data type, with this definition:

The string datatype represents character strings in XML. The ·value space· of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the Char production from [XML 1.0 (Second Edition)]. A character is an atomic unit of communication; it is not further specified except to note that every character has a corresponding Universal Character Set code point, which is an integer.

To me, this definition prohibits an XML parser from considering a string as anything other than a sequence of characters. That is, it can't apply knowledge about Unicode canonical equivalence or decomposition, etc. You put in a sequence of characters, you get out that sequence of characters. (The schema also defines a normalizedString data type, but that uses a completely different sense of normalization than we're discussing.)

Regards,
Ken

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to