Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Philip Martin Wed, 05 Jan 2011 05:18:08 -0800

Johan Corveleyn <jcor...@gmail.com> writes:

> Another question: a shell script might not be good, because not
> portable (and not fast)? Should I use python for this? Maybe the
> "write line by line with a line number in a for loop" would be a lot
> faster in Python? I don't know a lot of python, but it might be a good
> opportunity to learn some ...


A shell script is probably fine.  What I want is some data that I can
use on my machine to test your patches.

Here's a crude python script.  With the default values it generates two
4.3MB files in less than 2 seconds on my machine.  Subversion diff takes
over 10 seconds to compare the files, GNU diff less than one second.

Using --num-prefix=2 makes the script slight slower, since it generates
more random numbers, and the time to run Subversion diff on the output
goes up to 2min.  GNU diff still takes a fraction of a second, and with
--minimal the time is 35s.  So for big improvements you probably want to
concentrate on shortcut heuristics, rather than low-level optimisation.

#!/usr/bin/python

import random, sys
from optparse import OptionParser

random.seed('abc') # repeatable

def write_file_contents(f, num_lines, num_prefix, num_suffix,
                        percent_middle, unique):
  for i in range(num_lines):
    if num_prefix > 1:
      prefix = random.randint(1, num_prefix)
    else:
      prefix = 1
    line = str(prefix) + "-common-prefix-" + str(prefix)

    middle = random.randint(1, 100)
    if middle <= percent_middle:
       line += " " + str(12345678 + i) + " "
    else:
       line += " " + str(9999999999 + i) + unique + " "

    if num_suffix > 1:
      suffix = random.randint(1, num_suffix)
    else:
      suffix = 1
    line += str(suffix) + "-common-suffix-" + str(suffix)
    f.write(line + '\n')


parser = OptionParser('Generate files for diff')
parser.add_option('--num-lines', type=int, default=100000, dest='num_lines',
                  help='number of lines, default 100000')
parser.add_option('--num-prefix', type=int, default=1, dest='num_prefix',
                  help='number of distinct prefixes, default 1')
parser.add_option('--num-suffix', type=int, default=1, dest='num_suffix',
                  help='number of distinct suffixes, default 1')
parser.add_option('--percent-middle', type=int, default=99,
                  dest='percent_middle',
                  help='percentage matching middles, default 99')
(options, args) = parser.parse_args(sys.argv)

f1 = open('file1.txt', 'w')
write_file_contents(f1, options.num_lines,
                    options.num_prefix, options.num_suffix,
                    options.percent_middle, 'a')

f2 = open('file2.txt', 'w')
write_file_contents(f2, options.num_lines,
                    options.num_prefix, options.num_suffix,
                    options.percent_middle, 'b')
-- 
Philip

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Reply via email to