Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Johan Corveleyn Wed, 22 Dec 2010 16:51:58 -0800

On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin
<philip.mar...@wandisco.com> wrote:
> Johan Corveleyn <jcor...@gmail.com> writes:
>
>> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin
>> <philip.mar...@wandisco.com> wrote:
>>> Johan Corveleyn <jcor...@gmail.com> writes:
>>>
>>>> This makes the diff algorithm another 10% - 15%
>>>> faster (granted, this was measured with my "extreme" testcase of a 1,5
>>>> Mb file (60000 lines), of which most lines are identical
>>>> prefix/suffix).
>>>
>>> Can you provide a test script?  Or decribe the test more fully, please.
>>
>> Hmm, it's not easy to come up with a test script to test this "from
>> scratch" (unless with testing diff directly, see below). I test it
>> with a repository (a dump/load of an old version of our production
>> repository) which contains this 60000 line xml file (1,5 Mb) with 2272
>> revisions.
>>
>> I run blame on this file, over svnserve protocol on localhost (server
>> running on same machine), with an svnserve built from Stefan^2's
>> performance branch (with membuffer caching of full-texts, so server
>> I/O is not the bottleneck). This gives me an easy way to call 2272
>> times diff on this file, and measure it (with the help of some
>> instrumentation code in blame.c, see attachment). And it's
>> incidentally the actual use case I first started out wanting to
>> optimize (blame for large files with many revisions).
>
> Testing with real-world data is important, perhaps even more important
> than artificial test data, but some test data would be useful.  If you
> were to write a script to generate two test files of size 100MB, say,
> then you could use the tools/diff/diff utility to run Subversion diff on
> those two files.  Or tools/diff/diff3 if it's a 3-way diff that matters.
> The first run might involve disk IO, but on most machines the OS should
> be able to cache the files and subsequent hot-cache runs should be a
> good way to profile the diff code, assumming it is CPU limited.


Yes, that's a good idea. I'll try to spend some time on that. But I'm
wondering about a good way to write such a script.

I'd like the script to generate large files quickly, and with content
that's not totally random, but also not 1000000 times the exact same
line (either of those are not going to be representative for real
world data, might hit some edge behavior of the diff algorithm).
(maybe totally random is fine, but is there an easy/fast way to
generate this?)

As a first attempt, I quickly hacked up a small shell script, writing
out lines in a for loop, one by one, with a fixed string together with
the line number (index of the iteration). But that's too slow (10000
lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds).

Maybe I can start with 10 or 20 different lines (or generate 100 in a
for loop), and then start doubling that until I have enough (cat
file.txt >> file.txt). That will probably be faster. And it might be
"real-worldish" enough (a single source file also contains many
identical lines, e.g. all lines with a single brace etc.).

Other ideas? Maybe there is already something like this lying around?

Another question: a shell script might not be good, because not
portable (and not fast)? Should I use python for this? Maybe the
"write line by line with a line number in a for loop" would be a lot
faster in Python? I don't know a lot of python, but it might be a good
opportunity to learn some ...

Are there any examples of such "manual test scripts" in svn? So I
could have a look at the style, coding habits, ... maybe borrow some
boilerplate code?

Cheers,
-- 
Johan

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Reply via email to