Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Philip Martin Wed, 22 Dec 2010 02:50:58 -0800

Johan Corveleyn <jcor...@gmail.com> writes:

> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin
> <philip.mar...@wandisco.com> wrote:
>> Johan Corveleyn <jcor...@gmail.com> writes:
>>
>>> This makes the diff algorithm another 10% - 15%
>>> faster (granted, this was measured with my "extreme" testcase of a 1,5
>>> Mb file (60000 lines), of which most lines are identical
>>> prefix/suffix).
>>
>> Can you provide a test script?  Or decribe the test more fully, please.
>
> Hmm, it's not easy to come up with a test script to test this "from
> scratch" (unless with testing diff directly, see below). I test it
> with a repository (a dump/load of an old version of our production
> repository) which contains this 60000 line xml file (1,5 Mb) with 2272
> revisions.
>
> I run blame on this file, over svnserve protocol on localhost (server
> running on same machine), with an svnserve built from Stefan^2's
> performance branch (with membuffer caching of full-texts, so server
> I/O is not the bottleneck). This gives me an easy way to call 2272
> times diff on this file, and measure it (with the help of some
> instrumentation code in blame.c, see attachment). And it's
> incidentally the actual use case I first started out wanting to
> optimize (blame for large files with many revisions).


Testing with real-world data is important, perhaps even more important
than artificial test data, but some test data would be useful.  If you
were to write a script to generate two test files of size 100MB, say,
then you could use the tools/diff/diff utility to run Subversion diff on
those two files.  Or tools/diff/diff3 if it's a 3-way diff that matters.
The first run might involve disk IO, but on most machines the OS should
be able to cache the files and subsequent hot-cache runs should be a
good way to profile the diff code, assumming it is CPU limited.

-- 
Philip

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Reply via email to