etal wrote: > Here's an algorithm question: How should I efficiently merge a > collection of mostly similar lists, with different lengths and > arbitrary contents, while eliminating duplicates and preserving order > as much as possible? > > My code: > > def merge_to_unique(sources): > """Merge the unique elements from each list in sources into new > list. > > Using the longest input list as a reference, merges in the > elements from > each of the smaller or equal-length lists, and removes duplicates. > > @return: Combined list of elements. > """ > sources.sort(None, len, True) # Descending length > ref = sources[0] > for src in sources[1:]: > for i, s in enumerate(src): > if s and (ref[i] != s) and s not in ref: > ref.insert(ref.index(src[i-1])+1, s) > # Remove duplicates > return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]] > > > This comes up with using the CSV module's DictWriter class to merge a > set (list, here) of not-quite-perfect CSV sources. The DictWriter > constructor needs a list of field names so that it can convert > dictionaries into rows of the CSV file it writes. Some of the input > CSV files are missing columns, some might have extras -- all of this > should be accepted, and the order of the columns in the merged file > should match the order of the input files as much as possible (not > alphabetical). All of the list elements are strings, in this case, but > it would be nice if the function didn't require it. > > Speed actually isn't a problem yet; it might matter some day, but for > now it's just an issue of conceptual aesthetics. Any suggestions?
#untested import difflib def _merge(a, b): sm = difflib.SequenceMatcher(None, a, b) for op, a1, a2, b1, b2 in sm.get_opcodes(): if op == "insert": yield b[b1:b2] else: yield a[a1:a2] def merge(a, b): return sum(_merge(a, b), []) def merge_to_unique(sources): return reduce(merge, sorted(sources, key=len, reverse=True)) Peter -- http://mail.python.org/mailman/listinfo/python-list