Steve Bergman wrote: > Thanks, all. Yes, Levenshtein seems to be the magic word I was looking > for. (It's blazingly fast, too.) > > I suspect that if I strip out all the punctuation, etc. from both the > itemnumber and description columns, as suggested, and concatenate them, > pairing the record with its closest match in the other file, it ought > to be pretty accurate. Obviously, the final decision will be up to a > human being, but this should help them quite a bit. > > BTW, excluding all the items that match exactly, I only have 8000 items > in one file to compare to 2600 in the other. As fast as > python-levenshtein seems to be, this should finish in well under a > minute.
The above suggests that you plan to do a preliminary pass using exact comparison, and remove exact-matching pairs from further consideration. If that is the case, here are a few questions for you to ponder: What about 789o123 in file A and 789o123 in file B? Are you concerned about standardising your item-numbers? What about cases like 7890123 and 789o123 in file A? Are you concerned about duplicated records within a file? What about cases like 7890123 and 789o123 in file A and 7890123 and 789o123 and 78-901-23 in file B? Are you concerned about grouping all instances of the same item? If you are, the magic phrase you are looking for is "union find". HTH, John -- http://mail.python.org/mailman/listinfo/python-list