[EMAIL PROTECTED] wrote: > hi > i wrote some code to compare 2 files. One is the base file, the other > file i got from somewhere. I need to compare this file against the > base, > eg base file > abc > def > ghi > > eg another file > abc > def > ghi > jkl > > after compare , the base file will be overwritten with "jkl". Also both > files tend to grow towards > 20MB .. > > Here is my code...using difflib. > > pat = re.compile(r'^\+') ## i want to get rid of the '+' from the > difflib output... > def difference(filename,basename): > import difflib > base = open(basename) > a = base.readlines() > input = open(filename) > b = input.readlines() > d = difflib.Differ() > diff = list(d.compare(a, b)) > if len(diff) > 0: > os.remove(basename) > o = open(basename, "aU") > for i in diff: > if pat.search(i): > i = i.lstrip("\+ ") > o.writelines(i) ## write a new base > file... > o.close() > g = open(basename) > return g.readlines() > > Whenever the 2 files get very large, i find that it's very slow > comparing...any good advice to speed things up.? I thought of removing > readlines() method, and use line by line compare. Is it a better way? > thanks >
It seems like you want a new base that contains only those lines contained in 'filename' that are not contained in 'basename' where 'basename' is an ordered subset of filename. In other words, the 'filename' file has all of the lines of 'basename' in order somewhere but 'basename' has some additional lines. Is that correct? difflib looks to be overkill for this. Here is a suggestion: basefile = open(basename) newfile = open(filename) baseiter = basefile.xreadlines() newiter = newfile.xreadlines() newbase = open('tmp.txt', 'w') for baseline in baseiter: for newline in newiter: if baseline != newline: newbase.write(newline) else: break for afile in (basefile, newfile, newbase): afile.close() If 'basename'is not an ordered subset of 'filename', then difflib seems to be your best bet because you have a computationally intensive problem. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list