On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin <philip.mar...@wandisco.com> wrote: > Johan Corveleyn <jcor...@gmail.com> writes: > >> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin >> <philip.mar...@wandisco.com> wrote: >>> Johan Corveleyn <jcor...@gmail.com> writes: >>> >>>> This makes the diff algorithm another 10% - 15% >>>> faster (granted, this was measured with my "extreme" testcase of a 1,5 >>>> Mb file (60000 lines), of which most lines are identical >>>> prefix/suffix). >>> >>> Can you provide a test script? Or decribe the test more fully, please. >> >> Hmm, it's not easy to come up with a test script to test this "from >> scratch" (unless with testing diff directly, see below). I test it >> with a repository (a dump/load of an old version of our production >> repository) which contains this 60000 line xml file (1,5 Mb) with 2272 >> revisions. >> >> I run blame on this file, over svnserve protocol on localhost (server >> running on same machine), with an svnserve built from Stefan^2's >> performance branch (with membuffer caching of full-texts, so server >> I/O is not the bottleneck). This gives me an easy way to call 2272 >> times diff on this file, and measure it (with the help of some >> instrumentation code in blame.c, see attachment). And it's >> incidentally the actual use case I first started out wanting to >> optimize (blame for large files with many revisions). > > Testing with real-world data is important, perhaps even more important > than artificial test data, but some test data would be useful. If you > were to write a script to generate two test files of size 100MB, say, > then you could use the tools/diff/diff utility to run Subversion diff on > those two files. Or tools/diff/diff3 if it's a 3-way diff that matters. > The first run might involve disk IO, but on most machines the OS should > be able to cache the files and subsequent hot-cache runs should be a > good way to profile the diff code, assumming it is CPU limited.
Yes, that's a good idea. I'll try to spend some time on that. But I'm wondering about a good way to write such a script. I'd like the script to generate large files quickly, and with content that's not totally random, but also not 1000000 times the exact same line (either of those are not going to be representative for real world data, might hit some edge behavior of the diff algorithm). (maybe totally random is fine, but is there an easy/fast way to generate this?) As a first attempt, I quickly hacked up a small shell script, writing out lines in a for loop, one by one, with a fixed string together with the line number (index of the iteration). But that's too slow (10000 lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds). Maybe I can start with 10 or 20 different lines (or generate 100 in a for loop), and then start doubling that until I have enough (cat file.txt >> file.txt). That will probably be faster. And it might be "real-worldish" enough (a single source file also contains many identical lines, e.g. all lines with a single brace etc.). Other ideas? Maybe there is already something like this lying around? Another question: a shell script might not be good, because not portable (and not fast)? Should I use python for this? Maybe the "write line by line with a line number in a for loop" would be a lot faster in Python? I don't know a lot of python, but it might be a good opportunity to learn some ... Are there any examples of such "manual test scripts" in svn? So I could have a look at the style, coding habits, ... maybe borrow some boilerplate code? Cheers, -- Johan