Re: [LANG] Wanted - spec lawyer.

John Bollinger Tue, 30 Jun 2009 07:17:12 -0700

Jörg Schaible wrote:
> As pointed out http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets and
> http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets define the valid
> characters for XML 1.0 and 1.1.
> 
> However, the escape functionality is actually different. If you transport
> XML (or HTML) in a UTF-8 encoded text file or one encoded by ASCII-7 is a
> big difference. In the former you don't have to encode anything, while you
> have to encode anything above 0x7f in the latter case. And this applies to
> XML, HTML or Java source files at equal level.
> 
> The character set definition of the two XML versions is a vertical condition
> set. An attempt to encode a character outside the XML definition is
> actually a situation that cannot be handled and should raise an exception
> (like every XML parser will do anyway).
> 
> Therefore the question is, whether (Un)EscapeUtils should actually be an
> instance initialized with the target character encoding. And that raises
> the question how close we're actually at reimplementing
> java.nio.Charset.encode.

As I understand it, the basic idea of StringEscapeUtils.escapeXml() is to 
convert arbitrary character data from memory (a String) into a character 
sequence that has the same meaning when it appears literally in XML character 
data.  This is a conversion from character data to character data, so character 
encoding is not directly relevant for this use (and this is a fundamental 
difference from Charset.encode()).  The characters that must be escaped for 
this purpose are well defined by the XML specifications.

The appearance of an encoding attribute in the xml declaration
notwithstanding, the character encoding of an XML document is a
property of a representation of the document, not a property of the
document itself.  There is therefore a *separate*, albeit related, 
consideration of escaping characters that cannot be expressed in a particular 
character encoding, so as to be able to encode the document to a byte sequence 
without data loss. This is a useful thing to do, and it is compatible with the 
main objective, but I think it would be well to avoid conflating the two as an 
indivisible task.  They can be performed in one pass by one method, but they 
are logically distinct behaviors.

If StringEscapeUtils wants to support the second use, then it needs a way for 
the user to tell it which additional characters to escape.  One possibility 
would be to pass it a Charset which the user intends to apply (later) to encode 
the characters.  StringEscapeUtils could then escape those input characters for 
which Charset.canEncode() returns false.

Yet another separate question has arisen as to how to handle input characters 
which cannot appear in any way in a well formed XML (1.0 / 1.1) document, even 
as character references (e.g. U+0000).  I'm not so certain that 
StringEscapeUtils needs to be concerned about that, and it would simplify 
things immensely if it considered that out of scope.  Among other effects, I 
believe that would moot the distinction between XML 1.0 and XML 1.1 (and future 
versions) for this class.  In addition, I strongly suspect that there are 
multiple production applications that (mis)use XML in a way that would be 
broken if character references to characters outside the XML character set were 
flagged as application errors; it would be considerate for StringEscapeUtils to 
be compatible with such (mis)use.


Best Regards,

John

--
John Bollinger
thinma...@yahoo.com





---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
Re: [LANG] Wanted - spec lawyer.

Reply via email to