John Machin wrote: > On Feb 25, 7:21 pm, Christian Sonne <[EMAIL PROTECTED]> wrote: >> Long story short, I'm trying to find all ISBN-10 numbers in a multiline >> string (approximately 10 pages of a normal book), and as far as I can >> tell, the *correct* thing to match would be this: >> ".*\D*(\d{10}|\d{9}X)\D*.*" > > All of those *s are making it work too hard. > > Starting with your r".*\D*(\d{10}|\d{9}X)\D*.*" [you do have the > r"..." not just "....", don't you?] > > Step 1: Lose the .* off each end -- this is meaningless in the context > of a search() or findall() and would slow the re engine down if it > doesn't optimise it away. > > r"\D*(\d{10}|\d{9}X)\D*" > > Step 2: I presume that the \D* at each (remaining) end is to ensure > that you don't pick up a number with 11 or more digits. You only need > to test that your presumed ISBN is not preceded/followed by ONE > suspect character. Is ABC1234567890DEF OK? I think not; I'd use \b > instead of \D > > r"\b(\d{10}|\d{9}X)\b" > > Step 3: Now that we have only \b (which matches 0 characters) at each > end of the ISBN, we can lose the capturing () > > r"\b\d{10}|\d{9}X\b" > > Step 4: In the case of 123456789X, it fails on the X and then scans > the 123456789 again -- a tiny waste compared to all the * stuff, but > still worth fixing. > > r"\b\d{9}[0-9X]\b" > > Give that a whirl and let us know how correct and how fast it is. > >> (it should be noted that I've removed all '-'s in the string, because >> they have a tendency to be mixed into ISBN's) >> >> however, on my 3200+ amd64, running the following: >> >> reISBN10 = re.compile(".*\D*(\d{10}|\d{9}X)\D*.*") > > You should really get into the habit of using raw strings with re. > >> isbn10s = reISBN10.findall(contents) >> >> (where contents is the string) >> >> this takes about 14 minutes - and there are only one or two matches... > > How many actual matches and how many expected matches? > > Note on "and there are only one or two matches": if your data > consisted only of valid ISBNs separated by a single space or comma, it > would run much faster. It is all the quadratic/exponential mucking > about with the in-between bits that slows it down. To demonstrate > this, try timing dummy data like "1234567890 " * 1000000 and > "123456789X " * 1000000 with your various regexes, and with the step1, > step2 etc regexes above. > >> if I change this to match ".*[ ]*(\d{10}|\d{9}X)[ ]*.*" instead, I risk >> loosing results, but it runs in about 0.3 seconds >> >> So what's the deal? - why would it take so long to run the correct one? > > Because you have .*\D* in other words 0 or more occurrences of almost > anything followed by 0 or more occurrences of almost anything. Even > assuming it ignores the .*, it will find the longest possible sequence > of non-digits, then try to match the ISBN stuff. If it fails, it will > shorten that sequence of non-digits, try the ISBN stuff again, etc etc > until it matches the ISBN stuff or that sequence of non-digits is down > to zero length. It will do that for each character position in the > file contents. Why is it faster when you change \D to []? Presumably > because in your data, sequences of non-digits are longer than > sequences of spaces IOW there is less stuff to back-track over. > > >> - especially when a slight modification makes it run as fast as I'd >> expect from the beginning... >> >> I'm sorry I cannot supply test data, in my case, it comes from >> copyrighted material - however if it proves needed, I can probably >> construct dummy data to illustrate the problem > > What you call "dummy data", I'd call "test data". You should make a > sample of "dummy data" and test that your regex is (a) finding all > ISBNs that it should (b) not reporting incorrect matches, *BEFORE* > being concerned with speed. > >> Any and all guidance would be greatly appreciated, > > For some light reading :-) borrow a copy of Friedl's book (mentioned > in the Python re docs) and read the parts on how backtracking regex > engines work. > > HTH, > John >
Thanks to all of you for your replies - they have been most helpful, and my program is now running at a reasonable pace... I ended up using r"\b\d{9}[0-9X]\b" which seems to do the trick - if it turns out to misbehave in further testing, I'll know where to turn :-P -- http://mail.python.org/mailman/listinfo/python-list