Martin v. Lowis wrote: > Lorenzo Gatti wrote: >> Not providing an explicit listing of allowed characters is inexcusable >> sloppiness.
> That is a deliberate part of the specification. It is intentional that > it does *not* specify a precise list, but instead defers that list > to the version of the Unicode standard used (in the unicodedata > module). Ok, maybe you considered listing characters but you earnestly decided to follow an authority; but this reliance on the Unicode standard is not a merit: it defers to an external entity (UAX 31 and the Unicode database) a foundation of Python syntax. The obvious purpose of Unicode Annex 31 is defining a framework for parsing the identifiers of arbitrary programming languages, it's only, in its own words, "specifications for recommended defaults for the use of Unicode in the definitions of identifiers and in pattern-based syntax". It suggests an orderly way to add tens of thousands of exotic characters to programming language grammars, but it doesn't prove it would be wise to do so. You seem to like Unicode Annex 31, but keep in mind that: - it has very limited resources (only the Unicode standard, i.e. lists and properties of characters, and not sensible programming language design, software design, etc.) - it is culturally biased in favour of supporting as much of the Unicode character set as possible, disregarding the practical consequences and assuming without discussion that programming language designers want to do so - it is also culturally biased towards the typical Unicode patterns of providing well explained general algorithms, ensuring forward compatibility, and relying on existing Unicode standards (in this case, character types) rather than introducing new data (but the character list of Table 3 is unavoidable); the net result is caring even less for actual usage. >> The XML standard is an example of how listings of large parts of the >> Unicode character set can be provided clearly, exactly and (almost) >> concisely. > And, indeed, this is now recognized as one of the bigger mistakes > of the XML recommendation: they provide an explicit list, and fail > to consider characters that are unassigned. In XML 1.1, they try > to address this issue, by now allowing unassigned characters in > XML names even though it's not certain yet what those characters > mean (until they are assigned). XML 1.1 is, for practical purposes, not used except by mistake. I challenge you to show me XML languages or documents of some importance that need XML 1.1 because they use non-ASCII names. XML 1.1 is supported by many tools and standards because of buzzword compliance, enthusiastic obedience to the W3C and low cost of implementation, but this doesn't mean that its features are an improvement over XML 1.0. >>> ``ID_Continue`` is defined as all characters in ``ID_Start``, plus >>> nonspacing marks (Mn), spacing combining marks (Mc), decimal number >>> (Nd), and connector punctuations (Pc). >> >> Am I the first to notice how unsuitable these characters are? > Probably. Nobody in the Unicode consortium noticed, but what > do they know about suitability of Unicode characters... Don't be silly. These characters are suitable for writing text, not for use in identifiers; the fact that UAX 31 allows them merely proves how disconnected from actual programming language needs that document is. In typical word processing, what characters are used is the editor's problem and the only thing that matters is the correctness of the printed result; program code is much more demanding, as it needs to do more (exact comparisons, easy reading...) with less (straightforward keyboard inputs and monospaced fonts instead of complex input systems and WYSIWYG graphical text). The only way to work with program text successfully is limiting its complexity. Hard to input characters, hard to see characters, ambiguities and uncertainty in the sequence of characters, sets of hard to distinguish glyphs and similar problems are unacceptable. It seems I'm not the first to notice a lot of Unicode characters that are unsuitable for identifiers. Appendix I of the XML 1.1 standard recommends to avoid variation selectors, interlinear annotations (I missed them...), various decomposable characters, and "names which are nonsensical, unpronounceable, hard to read, or easily confusable with other names". The whole appendix I is a clear admission of self-defeat, probably the result of committee compromises. Do you think you could do better? Regards, Lorenzo Gatti -- http://mail.python.org/mailman/listinfo/python-list