Re: [PATCH in progress] Pristine text store - pristine_install

Neels Hofmeyr Wed, 23 Feb 2011 06:39:05 -0800

On Sat, 2011-02-19 at 04:53 +0100, Stefan Sperling wrote:
> On Fri, Feb 18, 2011 at 09:19:56PM -0500, Greg Stein wrote:
> > Can somebody provide a pointer to some of the latest speed analysis?
> 
> Neels is on vacation this week. When he returns, I'll prod him
> about running his performance tests again and sharing the results.


* neels prodded

if my tests are going to be "official", I feel they need some
verification / opinions. Possibly also extension so they test more than
ra_local.

- I run a pseudo-randomized checkout-switch-modify-merge-resolve series
in ra_local only. This emphasizes the timings of lib_wc, so that
additional working copy overhead causes a bad time factor. Example: The
test may spit out a time factor of 2 (twice as slow) even if the network
comm were commonly magnitudes slower and 'real' ra_* access would never
notice such a bad factor.

- On the other hand, if trunk for some reason were needing more ra_
connections than 1.6.x, we won't see that, since ra_local access timing
is negligible.

(Maybe it would be better to talk about added seconds of run time
instead of factors.)


Anyone else keen on forming an opinion on my humble tests? Let's break
it down.

I've got one py script that is able to run N tests for a single svn
build in a specific dir depth / dir spread config, and it writes its
results into a python pickle file.

The results add up the times that each subcommand takes to complete, by
name. E.g. all 'svn update' runs are added up.

Later runs can combine and compare pickle files and print stats.

A bash script calls a series of such svn-version/dir-depth/dir-spread
runs and finally compares the pickle files to print overall stats.

The svn commands run, roughly; these are python functions that call svn
in the way their names suggest:
[[[
      run_cmd(['svnadmin', 'create', repos])
      svn('checkout', file_url, wc)

      trunk = j(wc, 'trunk')
      create_tree(trunk, levels, spread)
      add(trunk)
      st(wc)
      ci(wc)
      up(wc)
      propadd_tree(trunk, 0.5)
      ci(wc)
      up(wc)
      st(wc)

      trunk_url = file_url + '/trunk'
      branch_url = file_url + '/branch'

      svn('copy', '-mm', trunk_url, branch_url)
      st(wc)

      up(wc)
      st(wc)

      svn('checkout', trunk_url, wc2)
      st(wc2)
      modify_tree(wc2, 0.5)
      st(wc2)
      ci(wc2)
      up(wc2)
      up(wc)

      svn('switch', branch_url, wc2)
      modify_tree(wc2, 0.5)
      st(wc2)
      ci(wc2)
      up(wc2)
      up(wc)

      modify_tree(trunk, 0.5)
      st(wc)
      ci(wc)
      up(wc2)
      up(wc)

      svn('merge', '--accept=postpone', trunk_url, wc2)
      st(wc2)
      svn('resolve', '--accept=mine-conflict', wc2)
      st(wc2)
      svn('resolved', '-R', wc2)
      st(wc2)
      ci(wc2)
      up(wc2)
      up(wc)

      svn('merge', '--accept=postpone', '--reintegrate', branch_url,
trunk)
      st(wc)
      svn('resolve', '--accept=mine-conflict', wc)
      st(wc)
      svn('resolved', '-R', wc)
      st(wc)
      ci(wc)
      up(wc2)
      up(wc)

      svn('delete', j(wc, 'branch'))
      ci(wc)
      up(wc2)
      up(wc)
]]]



Excerpts from the "outer layer" shell script:
[[[

batch(){
  levels="$1"
  spread="$2"
  N="$3"
  pre="${levels}x${spread}_"
  eval "$(pat bashrc)"
  pat use 1.6
  ./benchmark.py run ${pre}1.6_1.runs $levels $spread $N
  ./benchmark.py run ${pre}1.6_2.runs $levels $spread $N
  pat use 1.7
  ./benchmark.py run ${pre}1.7_1.runs $levels $spread $N
  ./benchmark.py run ${pre}1.7_2.runs $levels $spread $N

  <combine stats>
  <print stats>
]]]

This is a bash function that switches to svn 1.6 (using my humble helper
'pat' [1] to modify the PATH environment), runs the whole test N*2
times, then switches to svn 1.7 and again runs the thing 2N times. It
runs each build twice so that it can also compare two identical runs,
for us to verify whether those timing factors are sufficiently near 1.0.

Then that whole thing is run in three configurations (a: 4x4, b: 100x1,
c: 1x100); meaning how deep the deepest dir tree is ("levels") and how
many child dirs each dir has ("spread"), and that N times.

We can very easily modify these few numbers to choose test run size from
tiny to "infinite".

[[[
N=3   
# run a: levels 4, spread 4  (4x4)
al=4  
as=4

# run b: levels 100, spread 1 (100x1)
bl=100
bs=1

# run c...
cl=1  
cs=100   

batch $al $as $N
batch $bl $bs $N
batch $cl $cs $N

<combine stats>
<print overall stats>
]]]


I'd be delighted if anyone else wants to hack this stuff -- with or w/o
me.

~Neels


[1] I wrote pat for myself to take care of repetitive svn devel tasks. I
also use it to maintain several different svn builds alongside each
other, so it's rather large and unreviewed. In this test, pat is only
used to modify the PATH variable towards the 1.6 or the 1.7 build,
respectively. http://hofmeyr.de/code/pat/

#!/usr/bin/env python

"""
usage: benchmark.py run <run_file> <levels> <spread> [N]
       benchmark.py show <run_file>
       benchmark.py compare <run_file1> <run_file2>

Test data is written to run_file.
If a run_file exists, data is added to it.
<levels> is the number of directory levels to create
<spread> is the number of child trees spreading off each dir level
If <N> is provided, the run is repeated N times.
"""

import os, sys, time
import tempfile

from datetime import datetime, timedelta
from subprocess import Popen, PIPE, call
import random
import shutil

import cPickle

VERBOSE = False

DEFAULT_TIMINGS_PATH = './benchmark_py_last_run.py-pickle'

timings = None

def run_cmd(cmd, stdin=None, shell=False):

  if shell:
    printable_cmd = 'CMD: ' + cmd
  else:
    printable_cmd = 'CMD: ' + ' '.join(cmd)
  if VERBOSE:
    print printable_cmd

  if stdin:
    stdin_arg = PIPE
  else:
    stdin_arg = None

  p = Popen(cmd, stdin=stdin_arg, stdout=PIPE, stderr=PIPE, shell=shell)
  stdout,stderr = p.communicate(input=stdin)

  if VERBOSE:
    if (stdout):
      print "STDOUT: [[[\n%s]]]" % ''.join(stdout)
  if (stderr):
    print "STDERR: [[[\n%s]]]" % ''.join(stderr)

  return stdout,stderr

def timedelta_to_seconds(td):
  return ( float(td.seconds)
           + float(td.microseconds) / (10**6)
           + td.days * 24 * 60 * 60 )


class Timings:

  def __init__(self):
    self.timings = {}
    self.current_name = None
    self.tic_at = None

  def tic(self, name):
    self.toc()
    self.current_name = name
    self.tic_at = datetime.now()

  def toc(self):
    if self.current_name and self.tic_at:
      toc_at = datetime.now()
      self.submit_timing(self.current_name, 
                         timedelta_to_seconds(toc_at - self.tic_at))
    self.current_name = None
    self.tic_at = None

  def submit_timing(self, name, seconds):
    times = self.timings.get(name)
    if not times:
      times = []
      self.timings[name] = times
    times.append(seconds)

  def summary(self):
    s = ['count   min     max     avg    operation  (unit is seconds)']
    for name, timings in self.timings.items():
      if not name or not timings: continue

      s.append('%5d %7.3f %7.3f %7.3f  %s' % (
                 len(timings),
                 min(timings),
                 max(timings),
                 reduce(lambda x,y: x + y, timings) / len(timings),
                 name))
    return '\n'.join(s)

  def compare_to(self, other):
    s = ['  min     max     avg    operation  (unit is factor between runs)']
    def do_div(a, b):
      if b:
        return float(a) / float(b)
      else:
        return 0.0

    for name, timings in self.timings.items():
      other_timings = other.timings.get(name)
      if not other_timings:
        continue
      s.append('%7.3f %7.3f %7.3f  %s' % (
                 do_div(min(timings), min(other_timings)),
                 do_div(max(timings), max(other_timings)),
                 do_div(reduce(lambda x,y: x + y, timings) / len(timings),
                        reduce(lambda x,y: x + y, other_timings) /
                          len(other_timings)),
                 name))
    return '\n'.join(s)


  def add(self, other):
    for name, other_times in other.timings.items():
      my_times = self.timings.get(name)
      if not my_times:
        my_times = []
        self.timings[name] = my_times
      my_times.extend(other_times)




j = os.path.join

_create_count = 0

def next_name(prefix):
  global _create_count
  _create_count += 1
  return '_'.join((prefix, str(_create_count)))

def create_tree(in_dir, levels, spread=5):
  try:
    os.mkdir(in_dir)
  except:
    pass

  for i in range(spread):
    # files
    fn = j(in_dir, next_name('file'))
    f = open(fn, 'w')
    f.write('This is %s\n' % fn)
    f.close()

    # dirs
    if (levels > 1):
      dn = j(in_dir, next_name('dir'))
      create_tree(dn, levels - 1, spread)


def svn(*args):
  global timings
  name = args[0]
  cmd = ['svn']
  cmd.extend(args)
  if VERBOSE:
    print 'svn cmd: ' + ' '.join(cmd)
 
  stdin = None
  if stdin:
    stdin_arg = PIPE
  else:
    stdin_arg = None

  timings.tic(name)
  try:
    p = Popen(cmd, stdin=stdin_arg, stdout=PIPE, stderr=PIPE, shell=False)
    stdout,stderr = p.communicate(input=stdin)
  finally:
    timings.toc()

  if VERBOSE:
    if (stdout):
      print "STDOUT: [[[\n%s]]]" % ''.join(stdout)
    if (stderr):
      print "STDERR: [[[\n%s]]]" % ''.join(stderr)

  return stdout,stderr


def add(*args):
  return svn('add', *args)

def ci(*args):
  return svn('commit', '-mm', *args)

def up(*args):
  return svn('update', *args)

def st(*args):
  return svn('status', *args)

_chars = [chr(x) for x in range(ord('a'), ord('z') +1)]

def randstr(len=8):
  return ''.join( [random.choice(_chars) for i in range(len)] )

def _copy(path):
  dest = next_name(path + '_copied')
  svn('copy', path, dest)

def _move(path):
  dest = path + '_moved'
  svn('move', path, dest)

def _propmod(path):
  so, se = svn('proplist', path)
  propnames = [line.strip() for line in so.strip().split('\n')[1:]]

  # modify?
  if len(propnames):
    svn('ps', propnames[len(propnames) / 2], randstr(), path)

  # del?
  if len(propnames) > 1:
    svn('propdel', propnames[len(propnames) / 2], path)


def _propadd(path):
  # set a new one.
  svn('propset', randstr(), randstr(), path)


def _mod(path):
  if os.path.isdir(path):
    return _propmod(path)

  f = open(path, 'a')
  f.write('\n%s\n' % randstr())
  f.close()

def _add(path):
  if os.path.isfile(path):
    return _mod(path)

  if random.choice((True, False)):
    # create a dir
    svn('mkdir', j(path, next_name('new_dir')))
  else:
    # create a file
    new_path = j(path, next_name('new_file'))
    f = open(new_path, 'w')
    f.write(randstr())
    f.close()
    svn('add', new_path)

def _del(path):
  svn('delete', path)

_mod_funcs = (_mod, _add, _propmod, _propadd, )#_copy,) # _move, _del)
  
def modify_tree(in_dir, fraction):
  child_names = os.listdir(in_dir)
  for child_name in child_names:
    if child_name[0] == '.':
      continue
    if random.random() < fraction:
      path = j(in_dir, child_name)
      random.choice(_mod_funcs)(path)

  for child_name in child_names:
    if child_name[0] == '.': continue
    path = j(in_dir, child_name)
    if os.path.isdir(path):
      modify_tree(path, fraction)
  
def propadd_tree(in_dir, fraction):
  for child_name in os.listdir(in_dir):
    if child_name[0] == '.': continue
    path = j(in_dir, child_name)
    if random.random() < fraction:
      _propadd(path)
    if os.path.isdir(path):
      propadd_tree(path, fraction)


def run(levels, spread):
  global timings

  # ensure identical modifications for every run of this script
  random.seed(0)

  base = tempfile.mkdtemp()
  try:
    repos = j(base, 'repos')
    wc = j(base, 'wc')
    wc2 = j(base, 'wc2')

    file_url = 'file://%s' % repos

    so, se = run_cmd(['which', 'svn'])
    if not so:
      print "Can't find svn."
      exit(1)

    print '\nRunning svn benchmark in', base
    print 'dir levels: %s; new files and dirs per leaf: %s' % (levels, spread)
    so, se = svn('--version')
    print ', '.join( so.split('\n')[:2] )
    started = datetime.now()

    try:
      run_cmd(['svnadmin', 'create', repos])
      svn('checkout', file_url, wc)

      trunk = j(wc, 'trunk')
      create_tree(trunk, levels, spread)
      add(trunk)
      st(wc)
      ci(wc)
      up(wc)
      propadd_tree(trunk, 0.5)
      ci(wc)
      up(wc)
      st(wc)

      trunk_url = file_url + '/trunk'
      branch_url = file_url + '/branch'

      svn('copy', '-mm', trunk_url, branch_url)
      st(wc)

      up(wc)
      st(wc)

      svn('checkout', trunk_url, wc2)
      st(wc2)
      modify_tree(wc2, 0.5)
      st(wc2)
      ci(wc2)
      up(wc2)
      up(wc)

      svn('switch', branch_url, wc2)
      modify_tree(wc2, 0.5)
      st(wc2)
      ci(wc2)
      up(wc2)
      up(wc)

      modify_tree(trunk, 0.5)
      st(wc)
      ci(wc)
      up(wc2)
      up(wc)

      svn('merge', '--accept=postpone', trunk_url, wc2)
      st(wc2)
      svn('resolve', '--accept=mine-conflict', wc2)
      st(wc2)
      svn('resolved', '-R', wc2)
      st(wc2)
      ci(wc2)
      up(wc2)
      up(wc)

      svn('merge', '--accept=postpone', '--reintegrate', branch_url, trunk)
      st(wc)
      svn('resolve', '--accept=mine-conflict', wc)
      st(wc)
      svn('resolved', '-R', wc)
      st(wc)
      ci(wc)
      up(wc2)
      up(wc)

      svn('delete', j(wc, 'branch'))
      ci(wc)
      up(wc2)
      up(wc)


    finally:
      stopped = datetime.now()
      print '\nDone with svn benchmark in', (stopped - started)
      timings.submit_timing('TOTAL RUN', timedelta_to_seconds(stopped - started))

      # rename ps to prop mod
      if timings.timings.get('ps'):
        has = timings.timings.get('prop mod')
        if not has:
          has = []
          timings.timings['prop mod'] = has
        has.extend( timings.timings['ps'] )
        del timings.timings['ps']

      print timings.summary()
  finally:
    shutil.rmtree(base)


def read_from_file(file_path):
  f = open(file_path, 'rb')
  try:
    instance = cPickle.load(f)
  finally:
    f.close()
  return instance


def write_to_file(file_path, instance):
  f = open(file_path, 'wb')
  cPickle.dump(instance, f)
  f.close()

def usage():
  print __doc__

if __name__ == '__main__':
  if len(sys.argv) > 1 and 'compare'.startswith(sys.argv[1]):
    if len(sys.argv) < 4:
      usage()
      exit(1)
    
    p1,p2 = sys.argv[2:4]

    t1 = read_from_file(p1)
    t2 = read_from_file(p2)

    print p1
    print t1.summary()
    print '---'
    print p2
    print t2.summary()
    print '---'
    print p2, '/', p1
    print t2.compare_to(t1)

  elif len(sys.argv) > 1 and 'combine'.startswith(sys.argv[1]):
    if len(sys.argv) < 5:
      usage()
      exit(1)
    
    p1,p2,dest = sys.argv[2:5]

    t1 = read_from_file(p1)
    t2 = read_from_file(p2)

    t1.add(t2)
    print t1.summary()

    write_to_file(dest, t1)



  elif len(sys.argv) > 1 and 'run'.startswith(sys.argv[1]):
    try:
      timings_path = sys.argv[2]
      levels = int(sys.argv[3])
      spread = int(sys.argv[4])

      if len(sys.argv) > 5:
        N = int(sys.argv[5])
      else:
        N = 1
    except:
      usage()
      raise

      
    print '\nHi, going to run a Subversion benchmark (series)...'

    if os.path.isfile(timings_path):
      print 'Going to add results to existing file', timings_path
      timings = read_from_file(timings_path)
    else:
      print 'Going to write results to new file', timings_path
      timings = Timings()

    for i in range(N):
      run(levels, spread)

    write_to_file(timings_path, timings)

  elif len(sys.argv) > 1 and 'show'.startswith(sys.argv[1]):
    if len(sys.argv) < 2:
      usage()
      exit(1)
      
    for timings_path in sys.argv[2:]:
      timings = read_from_file(timings_path)
      print '---\n%s' % timings_path
      print timings.summary()

  else: usage()

run
Description: application/shellscript

Re: [PATCH in progress] Pristine text store - pristine_install

Reply via email to