Bugs item #1450212, was opened at 2006-03-15 18:05
Message generated for change (Comment added) made by perky
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1450212&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Interpreter Core
Group: Python 2.4
Status: Closed
Resolution: Invalid
Priority: 5
Submitted By: Peufeu (peufeu)
Assigned to: Nobody/Anonymous (nobody)
Summary: int() and isdigit() accept non-digit unicode numbers

Initial Comment:
I had a very surprising bug this morning, in a python script which 
extract numeric information from human entered text.

The problem is the following : many UNICODE characters, in 
UNICODE strings, are considered to be digits. For instance, the 
character "²" (does it appear on your screen ? it's u'\xb2').

The output of the following command is pretty interesting :

print ''.join([x for x in map( unichr, xrange( 65536 )) if x.isdigit()])

Then, int() will happily parse the string :

int( u"٥٦٧٨٩۰۱۲" )
56789012

(I really hope this bug system supports unicode).

However, I can't do a=٥٦٧٨٩۰۱۲ for instance.

Philosophically, Python is right, these characters are probably all 
digits, and it's pretty cool to be able to parse numbers written in 
ARABIC-INDIC DIGITs or something, as unicodedata.name says).

However, from a practical point of view, I guess most parsing done 
with python isn't on OCR'd cuneiform stone tablets, but rather 
modern computer documents...

Whenever a surface (in m²) was near a phone number in my human 
entered text, the "²" would be absorbed as a part of the phone 
number, because u"²".isdigit() is True. Then bullshit phone numbers 
would appear on the website.

Any number followed by a little footnote number will get the 
footnote number embedded...

I had to replace all the .isdigit() with a re.compile( ur"^\d+$" ).
match(). Interestingly, for re, even in unicode, \d is 0-9 and nothing 
else.

At least, it would be normal for int() to raise an exception when fed 
this type of data. Please.




----------------------------------------------------------------------

>Comment By: Hye-Shik Chang (perky)
Date: 2006-03-15 21:18

Message:
Logged In: YES 
user_id=55188

In the mean time, it can be simply regarded as unicode
conforming.
But a minor issue came up to my mind:

I think the name, `isdigit', is quite similar to ISO C's
equivalent.  But they don't behave same; ISO C and POSIX
SUSv3 specifies isdigit() is true only for 0 1 2 3 4 5 6 7 8
9.  So, isdigit() of C doesn't return true for any of
unicode characters > ord('9').  I just fear that the
inconsistency might cause some confusion.



----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-03-15 19:42

Message:
Logged In: YES 
user_id=38388

Python is following the Unicode standard in this respect.

If you want to make sure that only a subset of numbers is
parsed, I'd suggest that you write a little helper function
that implements the RE check and then lets int() do its work.

Rejecting as "invalid".


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1450212&group_id=5470
_______________________________________________
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to