In case it was too much text -- here is the story again, written in groovy: https://github.com/apache/groovy/pull/1366
Best, Simon On 07.08.20 23:58, Simon Tost wrote: > Hi all, > > > Story time: > So we build this script, reading from the REST API of a webapp, > writing some .xml file and uploading zipped into some sftp endpoint. > For writing .xml we used a textbook [1] like way [2] to build some > nice, horrifying, XML-ish document. > Using what amounts to unvalidated user input in some of the text nodes. > > To no ones surprise (at this point), now, 2 years later, the receiving > entity complains, that they find illegal characters '0x8' in their > uploads, which they cannot parse. > > > Turns out XML [3] and HTML [4] both have their own opinion, about what > characters are allowed in their documents. > But at least they agree, that most control characters (0x0 - 0x8; 0xB; > 0xC; 0xD - 0x1F) are bad, and some are at least 'discouraged'. > (0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF, ...) > > > Now, the "MarkupBuilder" is first and foremost called "MarkupBuilder". > So one could argue, that it /does/ handle the markup part just fine > and that's all that it /should/ do. > > On the other hand, the class proclaims itself in the javadoc [5] to be > "for creating XML or HTML markup". > And the documentation [1] also kindof markets it for that purpose. > (And, maybe, it's a bad look, to be able to write invalid .xml?) > > > So here is the question to you: > 1) Is the MarkupBuilder's behavior okay as-is? > 2) (if 1) What should the behavior be? > > 3) Is this historically a 'done discussion', and are we unwilling to > open up /that/ can of worms again? > (What was the previous consensus?) > > > > Going a bit further with this, personally, I could imagine: > * by default sanitizing the output of MarkupBuilder to a compatible > subset of characters for _both_ formats > * having some config option to switch to 'xml', 'html' or 'off' mode > for "character set validation" > * dealing with invalid characters by replacing them with \uFFFD (�) > character > (as one comment on the Jeff Atwood answer post [6] suggested) > > Which might be the maximum degree of changing things. > But I'm eager to hear some of your opinions. > > Any thoughts / arguments / things I've missed so far? > Any chance of finding some kind of consensus on the matter? > > > Best, > Simon > > > [1] https://groovy-lang.org/processing-xml.html#_markupbuilder > [2] > private toXmlFile(body) { > def writer = new StringWriter() > def xml = new MarkupBuilder(writer) > > body(xml) > > '<?xml version="1.0" encoding="UTF-8"?>' + "\n" + > writer.toString() + "\n" > } > > [3] > https://www.w3.org/TR/xml/#NT-Char > "Consequently, XML processors MUST accept any character in the range > specified for Char. > [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | > [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, > excluding the surrogate blocks, FFFE, and FFFF. */" > (Note: this is a "positive definition", and could be amended at some > point to include /more/ character.) > > [4] https://html.spec.whatwg.org/#character-references > "The numeric character reference forms described above are allowed to > reference any code point excluding U+000D CR, noncharacters > <https://infra.spec.whatwg.org/#noncharacter>, and controls > <https://infra.spec.whatwg.org/#control> other than ASCII whitespace > <https://infra.spec.whatwg.org/#ascii-whitespace>." > > [5] > https://docs.groovy-lang.org/latest/html/api/groovy/xml/MarkupBuilder.html > [6] > https://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters/961504#961504 > -- Simon Tost * simon.t...@tngtech.com * +49-176-17654629 TNG Technology Consulting GmbH, Beta-Str. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Gerhard Müller, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082