In <[EMAIL PROTECTED]>, Christian Sonne wrote: > Long story short, I'm trying to find all ISBN-10 numbers in a multiline > string (approximately 10 pages of a normal book), and as far as I can > tell, the *correct* thing to match would be this: > ".*\D*(\d{10}|\d{9}X)\D*.*" > > (it should be noted that I've removed all '-'s in the string, because > they have a tendency to be mixed into ISBN's) > > however, on my 3200+ amd64, running the following: > > reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*") > isbn10s = reISBN10.findall(contents) > > (where contents is the string) > > this takes about 14 minutes - and there are only one or two matches...
First of all try to get rid of the '.*' at both ends of the regexp. Don't let the re engine search for any characters that you are not interested in anyway. Then leave off the '*' after '\D'. It doesn't matter if there are multiple non-digits before or after the ISBN, there just have to be at least one. BTW with the star it even matches *no* non-digit too! So the re looks like this: '\D(\d{10}|\d{9}X)\D' Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list