hayes.ty...@gmail.com wrote:

My first thought is to do a sweep, where the first sweep takes one
line from f1, travels f2, if found, deletes it from a tmp version of
f2, and then on to the second line, and so on. If not found, it writes
to a file. At the end, if there are also lines still in f1 that never
were matched because it was longer, it appends those as well to the
difference file. At the end, you have a nice summary of the lines
(i.e., records) which are not found in either file.

Any suggestions where to start?


You can adapt and use this, provided the files are already sorted. Memory usage scales linearly with the size of the file difference, and time scales linearly with file sizes.


#!/usr/bin/env python

import sys


def run(fname_a, fname_b):
    filea = file(fname_a)
    fileb = file(fname_b)
    a_lines = set()
    b_lines = set()

    while True:
        a = filea.readline()
        b = fileb.readline()
        if not (a or b):
            break

        if a == b:
            continue

        if a in b_lines:
            b_lines.remove(a)
        elif a:
            a_lines.add(a)

        if b in a_lines:
            a_lines.remove(b)
        elif b:
            b_lines.add(b)


    for line in a_lines:
        print line

    if a_lines or b_lines:
        print ''
        print '***************'
        print ''

    for line in b_lines:
        print line


if __name__ == '__main__':
    run(sys.argv[1], sys.argv[2])

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to