On 16/10/16 16:16, Seymore4Head wrote: > How to pick out the same titles. > > I have a long text file that has movie titles in it and I would like > to find dupes. > > The thing is that sometimes I have one called "The Killing Fields" and > it also could be listed as "Killing Fields" Sometimes the title will > have the date a year off. > > What I would like to do it output to another file that show those two > as a match. > > I don't know the best way to tackle this. I would think you would > have to pair the titles with the most consecutive letters in a row. > > Anyone want this as a practice exercise? I don't really use > programming enough to remember how. >
Tokenize, generate (token) set similarity scores and cluster on similarity score. >>> import tokenization >>> bigrams1 = tokenization.n_grams("The Killing Fields".lower(), 2, pad=True) >>> bigrams1 ['_t', 'th', 'he', 'e ', ' k', 'ki', 'il', 'll', 'li', 'in', 'ng', 'g ', ' f', 'fi', 'ie', 'el', 'ld', 'ds', 's_'] >>> bigrams2 = tokenization.n_grams("Killing Fields".lower(), 2, pad=True) >>> import pseudo >>> pseudo.Jaccard(bigrams1, bigrams2) 0.7 You could probably just generate token sets, then iterate through all title pairs and manually review those with similarity scores above a suitable threshold. The code I used above is very simple (and pasted below). def n_grams(s, n, pad=False): # n >= 1 # returns a list of n-grams # or an empty list if n > len(s) if pad: s = '_' * (n-1) + s + '_' * (n-1) return [s[i:i+n] for i in range(len(s)-n+1)] def Jaccard(tokens1, tokens2): # returns exact Jaccard # similarity measure for # two token sets tokens1 = set(tokens1) tokens2 = set(tokens2) return len(tokens1&tokens2) / len(tokens1|tokens2) Duncan -- https://mail.python.org/mailman/listinfo/python-list