New submission from Ezio Melotti <ezio.melo...@gmail.com>: The decimal codec only handles characters in the Nd (Number, decimal) Unicode category and whitespaces [a]. It is used by int(), float(), complex() and indirectly by Decimal(), Fraction() and possibly others. This works well only for plain digits (e.g. int(u'123')) but it doesn't work for all the other characters used to represent numbers, like: 1. plus or minus sign, e.g. int(u'+123') or int(u'-123') 2. decimal point, e.g. float(u'1.23') 2.1 some languages/alphabets use other chars (e.g. a comma or other symbols) instead of the decimal point. 3. exponential notation, e.g. float(u'1e5') 4. the 'j' in complex numbers, e.g. complex(u'3j') 5. the 'x' and 'p' in hexadecimal floats, e.g. float.fromhex(u'0x1.7p3') 5.1 hex floats also uses hexadecimal digits, see 6.3 6. digits > 9 for numbers with a base > 10, e.g. int(u'7F', 16) 6.1 not all the alphabets have the equivalent of the letters a-z 6.2 afaik there are no standards that specify how to deal with digits >9 6.3 in the Unicode FAQ [b] there's a link to a table [c] that says "Code points not listed in this file are not hexadecimal digits." This is not a standard though, and even if in the UCD [d] there's a file [e] where the numbers with the Hex_Digit property are defined, it doesn't say that *only* these numbers are valid hex digits. Also it doesn't say anything about different bases. Python currently accepts int(u'10', 16), int(u'७', 16) (U+096D - DEVANAGARI DIGIT SEVEN) and even int(u'7F', 16) (with a normal F it works, with a fullwidth F it fails). 6.4 UTS #18 [f] includes in the property 'xdigit' [g] (hexadecimal digit) all the chars defined in [c] and also all the chars with a Nd category. This also is not a standard, and it doesn't give indications about the valid hex digits and how int() should behave. 6.5 if possible re and int() should agree. Any string that matches /^[[:xdigit:]]+$/ should work fine with int(s, 16) and vice versa. See also #6561 [h] and #2636 [i]. 7. possibly others
For all the chars listed in the points 1-5 there's no way, AFAIK, to know their equivalents in other alphabets (if they exist at all) and since (apparently) there's no standard that specifies how to handle them, they should be kept out. This will also avoid a number of problems, e.g. 2.1. The fullwidth forms are an exception though: they seem to be the only set of characters with a direct equivalent for all these chars, and they are also the only non-ascii chars included in the list of chars with the Unicode Hex_Digits property. Including all the necessary chars from this range in the decimal codec seems to me the best thing to do. The chars listed in the points 1-5 should all be implemented and they should work everywhere. The regex used by Decimal/Fraction should be updated as well, since the decimal codec is not accessible from Python (maybe it should be accessible, but this is another issue). Point 6 is a slightly different issue, even if it can be partially solved if the fullwidth forms will be included. One of the possible options is to limit the valid chars used by int() with bases > 10 only to the characters listed in [c], but this won't be backward-compatible with existing code and forward-compatible with [[:xdigit:]]. OTOH if we keep the current behavior it will be possible to express the digits from 0 to 9 using several alphabets, but all the digits > 9 will be limited to [a-fA-F] (and possibly [a-fA-F]). For example, '7F' in the devanagari alphabet will result in a mix of devanagari numbers and ascii letters, i.e. int(u'७F', 16) (this already works in Python). [a]: http://svn.python.org/view/python/trunk/Objects/unicodeobject.c?view=markup under 'Decimal Encoder' [b]: http://unicode.org/faq/casemap_charprop.html#13 [c]: http://unicode.org/faq/hex-digit-values.txt - [0-9a-fA- F0-9a-fA-F] [d]: http://unicode.org/Public/UNIDATA/UCD.html#UCD_Files - PropList.txt section [e]: http://unicode.org/Public/UNIDATA/PropList.txt [f]: http://unicode.org/reports/tr18/ - UTS #18: Unicode Regular Expressions [g]: http://unicode.org/reports/tr18/#Compatibility_Properties - xdigit row [h]: http://bugs.python.org/issue6561#msg90878 point (1) about int() and re [i]: http://bugs.python.org/issue2636#msg65513 point 8) will introduce [[:xdigit:]] (Thanks to Mark Dickinson and Adam Olsen for pointing out some of these issues.) ---------- components: Interpreter Core, Unicode messages: 91225 nosy: ezio.melotti priority: normal severity: normal status: open title: Include more fullwidth chars in the decimal codec type: feature request versions: Python 2.7, Python 3.2 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue6632> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com