[ python-Bugs-1450212 ] int() and isdigit() accept non-digit unicode numbers

SourceForge.net Wed, 15 Mar 2006 05:05:57 -0800

Bugs item #1450212, was opened at 2006-03-15 09:05
Message generated for change (Comment added) made by peufeu
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1450212&group_id=5470


Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Interpreter Core
Group: Python 2.4
Status: Closed
Resolution: Invalid
Priority: 5
Submitted By: Peufeu (peufeu)
Assigned to: Nobody/Anonymous (nobody)
Summary: int() and isdigit() accept non-digit unicode numbers

Initial Comment:
I had a very surprising bug this morning, in a python script which 
extract numeric information from human entered text.

The problem is the following : many UNICODE characters, in 
UNICODE strings, are considered to be digits. For instance, the 
character "Â²" (does it appear on your screen ? it's u'\xb2').

The output of the following command is pretty interesting :

print ''.join([x for x in map( unichr, xrange( 65536 )) if x.isdigit()])

Then, int() will happily parse the string :

int( u"Ù¥Ù¦Ù§Ù¨Ù©Û°Û±Û²" )
56789012

(I really hope this bug system supports unicode).

However, I can't do a=Ù¥Ù¦Ù§Ù¨Ù©Û°Û±Û² for instance.

Philosophically, Python is right, these characters are probably all 
digits, and it's pretty cool to be able to parse numbers written in 
ARABIC-INDIC DIGITs or something, as unicodedata.name says).

However, from a practical point of view, I guess most parsing done 
with python isn't on OCR'd cuneiform stone tablets, but rather 
modern computer documents...

Whenever a surface (in mÂ²) was near a phone number in my human 
entered text, the "Â²" would be absorbed as a part of the phone 
number, because u"Â²".isdigit() is True. Then bullshit phone numbers 
would appear on the website.

Any number followed by a little footnote number will get the 
footnote number embedded...

I had to replace all the .isdigit() with a re.compile( ur"^\d+$" ).
match(). Interestingly, for re, even in unicode, \d is 0-9 and nothing 
else.

At least, it would be normal for int() to raise an exception when fed 
this type of data. Please.




----------------------------------------------------------------------

>Comment By: Peufeu (peufeu)
Date: 2006-03-15 13:05

Message:
Logged In: YES 
user_id=587274

It certainly is confusing, and it bit me ;)

That .isdigit() is unicode-conformant is understandable (but a hint should 
be added to the docs IMHO). I with there was a .isasciidigit() function on 
the unicode string, because using a helper is ugly.

However int() accepting all these characters and happily parsing them 
worries me a bit more. Is it really supposed to do this ?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-03-15 12:32

Message:
Logged In: YES 
user_id=38388

I can see your point, but if we were to follow that scheme,
we'd have to introduce a whole new set of APIs for Unicode
character testing.

Note that the comparison to C standards is flawed in this
respect: 

Unicode APIs would have to be compared to the wide character
APIs, e.g. iswdigit() which do behave (more or less) like
isdigit() does in Python for Unicode characters.

Furthermore, the isXYZ() and iswXYZ() APIs in C are locale
aware (and so are the Python functions for strings), whereas
the Python Unicode implementation deliberately is not.

So in summary, you can't really compare the C functions to
the Python functions.


----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2006-03-15 12:18

Message:
Logged In: YES 
user_id=55188

In the mean time, it can be simply regarded as unicode
conforming.
But a minor issue came up to my mind:

I think the name, `isdigit', is quite similar to ISO C's
equivalent.  But they don't behave same; ISO C and POSIX
SUSv3 specifies isdigit() is true only for 0 1 2 3 4 5 6 7 8
9.  So, isdigit() of C doesn't return true for any of
unicode characters > ord('9').  I just fear that the
inconsistency might cause some confusion.



----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-03-15 10:42

Message:
Logged In: YES 
user_id=38388

Python is following the Unicode standard in this respect.

If you want to make sure that only a subset of numbers is
parsed, I'd suggest that you write a little helper function
that implements the RE check and then lets int() do its work.

Rejecting as "invalid".


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1450212&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[ python-Bugs-1450212 ] int() and isdigit() accept non-digit unicode numbers

Reply via email to