Re: One of those delighting topics: Control characters and the MarkupBuilder

Simon Tost Fri, 11 Sep 2020 14:15:19 -0700

In case it was too much text -- here is the story again, written in groovy:
https://github.com/apache/groovy/pull/1366


Best,
Simon



On 07.08.20 23:58, Simon Tost wrote:
> Hi all,
>
>
> Story time:
> So we build this script, reading from the REST API of a webapp,
> writing some .xml file and uploading zipped into some sftp endpoint.
> For writing .xml we used a textbook [1] like way [2] to build some
> nice, horrifying, XML-ish document.
> Using what amounts to unvalidated user input in some of the text nodes.
>
> To no ones surprise (at this point), now, 2 years later, the receiving
> entity complains, that they find illegal characters '0x8' in their
> uploads, which they cannot parse.
>
>
> Turns out XML [3] and HTML [4] both have their own opinion, about what
> characters are allowed in their documents.
> But at least they agree, that most control characters (0x0 - 0x8; 0xB;
> 0xC; 0xD - 0x1F) are bad, and some are at least 'discouraged'.
> (0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, ...)
>
>
> Now, the "MarkupBuilder" is first and foremost called "MarkupBuilder".
> So one could argue, that it /does/ handle the markup part just fine
> and that's all that it /should/ do.
>
> On the other hand, the class proclaims itself in the javadoc [5] to be
> "for creating XML or HTML markup".
> And the documentation [1] also kindof markets it for that purpose.
> (And, maybe, it's a bad look, to be able to write invalid .xml?)
>
>
> So here is the question to you:
> 1) Is the MarkupBuilder's behavior okay as-is?
> 2) (if 1) What should the behavior be?
>
> 3) Is this historically a 'done discussion', and are we unwilling to
> open up /that/ can of worms again?
> (What was the previous consensus?)
>
>
>
> Going a bit further with this, personally, I could imagine:
> * by default sanitizing the output of MarkupBuilder to a compatible
> subset of characters for _both_ formats
> * having some config option to switch to 'xml', 'html' or 'off' mode
> for "character set validation"
> * dealing with invalid characters by replacing them with \uFFFD (�)
> character
>   (as one comment on the Jeff Atwood answer post [6] suggested)
>
> Which might be the maximum degree of changing things.
> But I'm eager to hear some of your opinions.
>
> Any thoughts / arguments / things I've missed so far?
> Any chance of finding some kind of consensus on the matter?
>
>
> Best,
> Simon
>
>
> [1] https://groovy-lang.org/processing-xml.html#_markupbuilder
> [2]
>     private toXmlFile(body) {
>         def writer = new StringWriter()
>         def xml = new MarkupBuilder(writer)
>
>         body(xml)
>
>         '<?xml version="1.0" encoding="UTF-8"?>' + "\n" +
> writer.toString() + "\n"
>     }
>
> [3]
> https://www.w3.org/TR/xml/#NT-Char
> "Consequently, XML processors MUST accept any character in the range
> specified for Char.
> [2]       Char       ::=       #x9 | #xA | #xD | [#x20-#xD7FF] |
> [#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character,
> excluding the surrogate blocks, FFFE, and FFFF. */"
> (Note: this is a "positive definition", and could be amended at some
> point to include /more/ character.)
>
> [4] https://html.spec.whatwg.org/#character-references
> "The numeric character reference forms described above are allowed to
> reference any code point excluding U+000D CR, noncharacters
> <https://infra.spec.whatwg.org/#noncharacter>, and controls
> <https://infra.spec.whatwg.org/#control> other than ASCII whitespace
> <https://infra.spec.whatwg.org/#ascii-whitespace>."
>
> [5]
> https://docs.groovy-lang.org/latest/html/api/groovy/xml/MarkupBuilder.html
> [6]
> https://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters/961504#961504
>


-- 
Simon Tost * simon.t...@tngtech.com * +49-176-17654629
TNG Technology Consulting GmbH, Beta-Str. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Gerhard Müller, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: One of those delighting topics: Control characters and the MarkupBuilder

Reply via email to