Re: Fuzzy string comparison

2006-12-28 Thread jmw
ion, you mentioned the data you need to run comparisons on is stored in a database. Is this string comparison a one-time processing kind of thing to clean up the data, or are you going to have to continually do fuzzy string comparison on the data in the database? There are some papers out ther

Re: Fuzzy string comparison

2006-12-28 Thread jmw
ion, you mentioned the data you need to run comparisons on is stored in a database. Is this string comparison a one-time processing kind of thing to clean up the data, or are you going to have to continually do fuzzy string comparison on the data in the database? There are some papers out ther

Re: Fuzzy string comparison

2006-12-27 Thread Gabriel Genellina
At Wednesday 27/12/2006 18:59, John Machin wrote: > Thanks, all. Yes, Levenshtein seems to be the magic word I was looking > for. (It's blazingly fast, too.) In case you need something more, this article is a good starting point: Record Linkage: A Machine Learning Approach, A Toolbox, and A D

Re: Fuzzy string comparison

2006-12-27 Thread John Machin
Steve Bergman wrote: > Thanks, all. Yes, Levenshtein seems to be the magic word I was looking > for. (It's blazingly fast, too.) > > I suspect that if I strip out all the punctuation, etc. from both the > itemnumber and description columns, as suggested, and concatenate them, > pairing the record

Re: Fuzzy string comparison

2006-12-27 Thread Steve Bergman
Thanks, all. Yes, Levenshtein seems to be the magic word I was looking for. (It's blazingly fast, too.) I suspect that if I strip out all the punctuation, etc. from both the itemnumber and description columns, as suggested, and concatenate them, pairing the record with its closest match in the ot

Re: Fuzzy string comparison

2006-12-27 Thread Steven D'Aprano
On Wed, 27 Dec 2006 02:52:42 -0800, John Machin wrote: > > Duncan Booth wrote: >> "John Machin" <[EMAIL PROTECTED]> wrote: >> >> > To compare two strings, take copies, and: >> >> Taking a copy of a string seems kind of superfluous in Python. > > You are right, I really meant don't do: > orig

Re: Fuzzy string comparison

2006-12-27 Thread John Machin
Duncan Booth wrote: > "John Machin" <[EMAIL PROTECTED]> wrote: > > > To compare two strings, take copies, and: > > Taking a copy of a string seems kind of superfluous in Python. You are right, I really meant don't do: original = original.strip().replace().replace() (a strange way of d

Re: Fuzzy string comparison

2006-12-27 Thread Jorge Godoy
"Steve Bergman" <[EMAIL PROTECTED]> writes: > I'm looking for a module to do fuzzy comparison of strings. I have 2 > item master files which are supposed to be identical, but they have > thousands of records where the item numbers don't match in various > ways. One might include a '-' or have le

Re: Fuzzy string comparison

2006-12-27 Thread Duncan Booth
"John Machin" <[EMAIL PROTECTED]> wrote: > To compare two strings, take copies, and: Taking a copy of a string seems kind of superfluous in Python. -- http://mail.python.org/mailman/listinfo/python-list

Re: Fuzzy string comparison

2006-12-26 Thread John Machin
Carsten Haese wrote: > On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote: > > Wojciech Mula wrote: > > > Steve Bergman wrote: > > > > I'm looking for a module to do fuzzy comparison of strings. [...] > > > > > > Check module difflib, it returns difference between two sequences. > > > > and it's

Re: Fuzzy string comparison

2006-12-26 Thread Gabriel Genellina
At Tuesday 26/12/2006 18:08, John Machin wrote: Wojciech Mula wrote: > Steve Bergman wrote: > > I'm looking for a module to do fuzzy comparison of strings. [...] > > Check module difflib, it returns difference between two sequences. and it's intended for comparing text files, and is relatively

Re: Fuzzy string comparison

2006-12-26 Thread Carsten Haese
On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote: > Wojciech Mula wrote: > > Steve Bergman wrote: > > > I'm looking for a module to do fuzzy comparison of strings. [...] > > > > Check module difflib, it returns difference between two sequences. > > and it's intended for comparing text files, a

Re: Fuzzy string comparison

2006-12-26 Thread John Machin
Wojciech Mula wrote: > Steve Bergman wrote: > > I'm looking for a module to do fuzzy comparison of strings. [...] > > Check module difflib, it returns difference between two sequences. and it's intended for comparing text files, and is relatively slow. Google "python levenshtein". You'll probably

Re: Fuzzy string comparison

2006-12-26 Thread Wojciech Muła
Steve Bergman wrote: > I'm looking for a module to do fuzzy comparison of strings. [...] Check module difflib, it returns difference between two sequences. -- http://mail.python.org/mailman/listinfo/python-list

Fuzzy string comparison

2006-12-26 Thread Steve Bergman
I'm looking for a module to do fuzzy comparison of strings. I have 2 item master files which are supposed to be identical, but they have thousands of records where the item numbers don't match in various ways. One might include a '-' or have leading zeros, or have a single character missing, or a