Johan Corveleyn <jcor...@gmail.com> writes: > On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin > <philip.mar...@wandisco.com> wrote: >> Johan Corveleyn <jcor...@gmail.com> writes: >> >>> This makes the diff algorithm another 10% - 15% >>> faster (granted, this was measured with my "extreme" testcase of a 1,5 >>> Mb file (60000 lines), of which most lines are identical >>> prefix/suffix). >> >> Can you provide a test script? Or decribe the test more fully, please. > > Hmm, it's not easy to come up with a test script to test this "from > scratch" (unless with testing diff directly, see below). I test it > with a repository (a dump/load of an old version of our production > repository) which contains this 60000 line xml file (1,5 Mb) with 2272 > revisions. > > I run blame on this file, over svnserve protocol on localhost (server > running on same machine), with an svnserve built from Stefan^2's > performance branch (with membuffer caching of full-texts, so server > I/O is not the bottleneck). This gives me an easy way to call 2272 > times diff on this file, and measure it (with the help of some > instrumentation code in blame.c, see attachment). And it's > incidentally the actual use case I first started out wanting to > optimize (blame for large files with many revisions).
Testing with real-world data is important, perhaps even more important than artificial test data, but some test data would be useful. If you were to write a script to generate two test files of size 100MB, say, then you could use the tools/diff/diff utility to run Subversion diff on those two files. Or tools/diff/diff3 if it's a 3-way diff that matters. The first run might involve disk IO, but on most machines the OS should be able to cache the files and subsequent hot-cache runs should be a good way to profile the diff code, assumming it is CPU limited. -- Philip