Chris Angelico wrote: > On Sun, Mar 13, 2016 at 6:24 AM, Thomas 'PointedEars' Lahn > <pointede...@web.de> wrote: >> Marko Rauhamaa wrote: >>> […] HTML markup is all ASCII. >> >> Wrong. I am creating HTML documents whose source code contains Unicode >> characters every day. >> >> Also, the two of you fail to differentiate between US-ASCII, a 7-bit >> character encoding, and 8-bit or longer encodings which can *also* encode >> characters that can be *encoded with* US-ASCII. > > Where are the non-ASCII characters in your HTML documents? Are they in > the *markup* of HTML, or in the *text*? This is the difference.
There is a misconception on your part instead. The text content of an HTML/Web document (the part between the [HTML] tags) is *part* of the (HTML) markup as it is (at least) *a part* of the content of (HTML) elements. [1a] [1b] Besides, even if one would unwisely adopt your private definition of “markup”, Unicode characters that cannot be encoded with US-ASCII are of course allowed verbatim in attribute values, and to a lesser degree (not in HTML 4.01 and below) in element type names and attribute names, as well – therefore, according to even your *wrong* private definition of “markup”, “*in* the markup of HTML”. [2][3] Bottom line: If one declares the character encoding that one uses in an SGML-based (HTML up to including version 4.01, XML and all XML-based document types) or SGML- related (HTML5) markup document (there are several possibilities for that)¹, there is no need to use character entity references instead of plain Unicode characters. And if you avoid spaghetti code, the probability of the need for numeric character references in HTML is also quite low. (The same applies to lightweight markup languages like Markdown, but let us not get there now.) [In fact, the possibility to use characters verbatim other than those that can be encoded with US-ASCII applies to all Internet messages, including e-mail and Usenet postings, and to a lesser degree (because there are fewer declaration mechanisms available) to all forms of electronically stored/readable text. As of RFC 5536, standards-compliant Network News client software is even required to support MIME. [4]] [This was a professional Web author/developer with more than a decade of continuing work experience clarifying your misconception. I recommend to you that you subscribe to the newsgroups in the comp.infosystems.www.authoring.* hierarchy, where this discussion would have been on-topic, and to <news:comp.lang.javascript>, to clarify some of the other misconceptions that you may have acquired about Web(-related) authoring/development.] ________ ¹ This is only to be reasonably safe from surprises; several of those markup languages require the assumption of a default character encoding and/or the implementation of character encoding detection for their parsers, but not all parsers are conforming, and it stands to reason that parser efficiency can be increased if the encoding does not have to be detected/inferred at first. [1a] <https://en.wikipedia.org/wiki/Markup_language#Etymology_and_origin> [1b] <https://www.w3.org/TR/1999/REC-html401-19991224 /intro/sgmltut.html#h-3.2.1> <http://www.w3.org/TR/2014/REC-html5-20141028/dom.html#elements> [2] <http://www.w3.org/TR/2014/REC-html5-20141028 /infrastructure.html#encoding-terminology> [3] <https://www.w3.org/TR/1999/REC-html401-19991224 /charset.html#doc-char-set> <http://www.w3.org/TR/2014/REC-html5-20141028/syntax.html#parsing> [4] <http://tools.ietf.org/html/rfc5536#section-2.3> > And I'm not conflating those two. When I say ASCII, I am referring to > the 128 characters that have Unicode codepoints U+0000 through U+007F. That is only your private definition of ASCII. The commonly accepted definition is along those lines instead: <https://en.wikipedia.org/wiki/ASCII> pp. (See also the Specification references above.) HTH -- PointedEars Twitter: @PointedEars2 Please do not cc me. / Bitte keine Kopien per E-Mail. -- https://mail.python.org/mailman/listinfo/python-list