On 2022-08-01 16:12 Hussein Shafie wrote:
On 7/31/22 16:15, Leif H Silli wrote:
Title:

    RFE: Let automatic oneway conversion of named character entities
         *also* apply to documents without DTD

… snipped …

We are well-aware of this limitation. Always have been.

For example, see Norman Walsh (famous XML expert and very much
appreciated by us) rants about XMLmind XML Editor:

February 23, 2006 (!!!) https://norman.walsh.name/2006/02/23/whitespace
---
Hey, you! Yeah, you! XML editing tool vendor! Lemme ask you something,
why is it that you think you can fuck with the white space in my mixed
content? White space in mixed content is significant. If I put it
there, leave it alone! If I didn't put it there, keep your “helpful”
fingers out of it!
...
I'm talking to you, XMLmind.
---

It perhaps does not mattery very much - your attitude to my proposal might be the same, either way … However, I would like to point out that Walsh’s rant does not really hit the same nail that I try to hit …

Because: Please note that I am not begging for any new behavior with regard to whitepace. Instead, I suggest that you follow the same pattern for named entities as you already follow for whitespace: Destroy them, but destroy them in an XML-compatible manner.

For any undeclared entity found in an XHTML document (or in a SVG or MathML document for that matter - we do not need to be XHTML-specific - the entities that HTML5 declares, are collected from HTML5, SVG and MathML, in order to support interoperability between HTML, SVG and MathML and so on), let XXE check if the names of the found entities occur on the list that is declared by HTML5. And if they occur on that list, assume that the entity (or entities) are meant to refer to the characters declared by HTML5. And replace them with either XML-compatible character references and/or with directly typed characters.

Possibly, as well, issue a warning before the user is permitted to save or edit the freshly opened document any further. (Such a warning would be more than you currently do for re-arranged whitespace. However, since entities are supposed to have a meaning defined outside the document, such a warning would make sense.)

So, just to, once again, emphasize that I suggest to copy XXE’s current behavior with regard to whitespace, and apply it to HTML5’s named character entities as well, another way to say it, is this:

Whitespace (according to the XML spec) «consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs». These four characters are thus treated as synonyms. So when XXE “destroys” the whitespace that some other authoring tool or human author added to a document, it simply means that it chooses the ”synonym” character that fits the best with XXE’s rearrangment plan (which is affected, as well, by the user configuration of XXE’s whitespace treatment). Typically, it replaces tabs with spaces, as well as inserting hard return/line-break wherever necessary.

Likewise, when it comes to the named character entities defined by HTML5 (but which applies to SVG and MathML just as much as they apply to HTML), each named character reference is just a synonym for the directly typed character as well as for its decimal or hexadecimal numeric character reference. From a (human) author point of view (and perhaps even from a computer program’s point of view), the use of   instead of the directly typed character (or numeric character reference) can be highly significant. For instance, for a human, it is easier to spot the   entity than it (usually) is to spot the no-break-character directly. Hence, there will be some users that would drop into rant mode, when they discover an XML editor that converts the HTML5-defined named character entities to their Unicode defined counterparts. It is, in fact, some kind of destruction of the source code.

And to add a third similarity: For some authors, whitespace is important. For many others, it isn’t. That is why XXE can get away with it is current behavior. The same would be the case for the treatment of HTML5 named entities that I suggest.

Btw, I still believe that XXE should respect entity declarations when they exist, so that one could override the HTML5 named entity declarations. Yeah, probably the behavior I suggest, should be applied only to documents which lack named character entity declarations.


    Finally, there is already an option in the Preferences to «Simulate a DTD» when there is no DTD, and it would certainly be in place, and make sense, and be in line with the HTML5 spec, to (at least with a warning) simulate that named character entities has been declared.

"Simulate a DTD" does not mean guessing. It simply uses the elements
and attributes already found in a schema-less document instance to
make it quicker and easier adding more elements and more attributes
having the same names.

Thanks for explaining. Makes sense, now ...


   From time to time, there are complaints about XXE’s “destructive” treatment of source code. My claim is that, with regard to named character references, XXE would fit better in its common work flows if it would go all out in its “destructive” behavior.

Once again, you are right.

However we currently don't plan to change XXE behavior in this regard
in the near future.

If so, I guess I need to continue to add DTDs that declare those entities then ...

Note that, quite honestly, "XXE’s “destructive” treatment of source
code" is flagged in red as being a possible "deal breaker". See
http://www.xmlmind.com/xmleditor/features.html

Therefore if this limitation is really a problem for an XML author,
she/he must not even attempt to use XXE.

Once again, I will reiterate that I do not suggest to stop the destruction. I instead ask for more destruction … ;-)

Leif Halvard Silli

--
XMLmind XML Editor Support List
xmleditor-support@xmlmind.com
http://www.xmlmind.com/mailman/listinfo/xmleditor-support

Reply via email to