Jon Smirl wrote: > I only have a passing acquaintance with Python and I need to modify some > existing code. This code is going to get called with 10GB of data so it > needs to be fairly fast. > > http://cvs2svn.tigris.org/ is code for converting a CVS repository to > Subversion. I'm working on changing it to convert from CVS to git. > > The existing Python RCS parser provides me with the CVS deltas as > strings.I need to get these deltas into an array of lines so that I can > apply the diff commands that add/delete lines (like 10 d20, etc). What is > the most most efficient way to do this? The data structure needs to be > able to apply the diffs efficently too. > > The strings have embedded @'s doubled as an escape sequence, is there an > efficient way to convert these back to single @'s? > > After each diff is applied I need to convert the array of lines back into > a string, generate a sha-1 over it and then compress it with zlib and > finally write it to disk. > > The 10GB of data is Mozilla CVS when fully expanded. > > Thanks for any tips on how to do this. > > Jon Smirl > [EMAIL PROTECTED]
Splitting a string into a list (array) of lines is easy enough, if you want to discard the line endings, lines = s.splitlines() or, if you want to keep them, lines = s.splitlines(True) replacing substrings in a string is also easy, s = s.replace('@@', '@') For efficiency, you'll probably want to do the replacement first, then split: lines = s.replace('@@', '@').splitlines() Once you've got your list of lines, python's awesome list manipulation should makes applying diffs very easy. For instance, to replace lines 3 to 7 (starting at zero) you could assign a list (containing the replacement lines) to a "slice" of the list of lines: lines[3:8] = replacement_lines Where replacement_lines is a list containing the replacement lines. There's a lot more to this, read up on python's lists. To convert the list back into one string use the join() method; if you kept the line endings, s = "".join(lines) or if you threw them away, s = "\n".join(lines) Python has standard modules for sha-1 digest, sha, and zlib compression, zlib. See http://docs.python.org/lib/lib.html HTH, enjoy, ~Simon -- http://mail.python.org/mailman/listinfo/python-list