On Jun 1, 1:49 am, Peter Otten <[EMAIL PROTECTED]> wrote:
> Peter Otten wrote:
> > #untested
>
> Already found two major blunders :(
>
> # still untested
> import difflib
>
> def _merge(a, b):
>     sm = difflib.SequenceMatcher(None, a, b)
>     for op, a1, a2, b1, b2 in sm.get_opcodes():
>         if op == "insert":
>             yield b[b1:b2]
>         elif op == "replace":
>             yield a[a1:a2]
>             yield b[b1:b2]
>         else: # delete, equal
>             yield a[a1:a2]
>
> def merge(a, b):
>     return sum(_merge(a, b), [])
>
> def merge_to_unique(sources):
>     return unique(reduce(merge, sorted(sources, key=len, reverse=True)))
>

difflib.SequenceMatcher looks promising; I'll try it. Thanks!


> def unique(items):
>     u = set(items)
>     if len(u) == len(items):
>         return items
>     result = []
>     for item in items:
>         if item in u:
>             result.append(item)
>             u.remove(item)
>     return result

You did right by preserving the original (non-alphabetical) ordering,
but I'm less enthusiastic about the shape of this function. My
original function used 7 lines of code, and only 1 for the unique()
step. This uses up to three container objects. Is it really an
improvement?

(Secret: the reference list (or, any of the sources) is unlikely to be
more than a few dozen elements long. The data set that puts
merge_to_unique through a workout will be a giant list of
comparatively short lists, so the unique() part just needs to be short
and conceptually clean, while merge() should attempt sane behavior for
large len(sources).)
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to