Long story short, I'm trying to find all ISBN-10 numbers in a multiline string (approximately 10 pages of a normal book), and as far as I can tell, the *correct* thing to match would be this: ".*\D*(\d{10}|\d{9}X)\D*.*"
(it should be noted that I've removed all '-'s in the string, because they have a tendency to be mixed into ISBN's) however, on my 3200+ amd64, running the following: reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*") isbn10s = reISBN10.findall(contents) (where contents is the string) this takes about 14 minutes - and there are only one or two matches... if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk loosing results, but it runs in about 0.3 seconds So what's the deal? - why would it take so long to run the correct one? - especially when a slight modification makes it run as fast as I'd expect from the beginning... I'm sorry I cannot supply test data, in my case, it comes from copyrighted material - however if it proves needed, I can probably construct dummy data to illustrate the problem Any and all guidance would be greatly appreciated, kind regards Christian Sonne PS: be gentle - it's my first post here :-) -- http://mail.python.org/mailman/listinfo/python-list