Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Gavin Beau Baumanis Thu, 23 Dec 2010 14:43:49 -0800

On 23/12/2010, at 11:36 PM, Johan Corveleyn wrote:

> On Thu, Dec 23, 2010 at 3:44 AM, Gavin Beau Baumanis
> <[email protected]> wrote:
>> Hi Johan,
> 
> Hi Gavin. Thanks for your interest. It's a nice problem isn't it?


Yes - it "should be" so very simple... but a little thought  - and before you 
know it - it's not!


> 
>> I was intrigued by your requirement to create a large file for testing.
>> I remember from a really long time ago when I learnt C, that we used a 
>> specific algorithm for creating "natural" and "random" text.
>> With some help from Mr.Google found out about Markov Chains that look 
>> promising - I can't remember if that was what I learned about or not  - but 
>> it looks like it might be a prove helpful none the less.
>> 
>> A little further Googlng and I found this specific post on stackoverflow.
>> http://stackoverflow.com/questions/1037719/how-can-i-quickly-create-large-1gb-textbinary-files-with-natural-content
> 
> Interesting link. If I'm reading it correctly, the best suggestion in
> there was with those Markov Chains. But it seems maybe a little
> overkill / too much work for me.

I might have an answer for you there.
I pointed out this thread to our intern.
He seems to know about Marchov chains already - I have asked him to contribute 
to the thread.


> 
>> No Idea if it is going to help you specifically or not... but there are 
>> quite a few ideas in the comments;
>> * Obtain a copy of the first 100MB from wikipedia - for example.
> 
> Nah, I wouldn't like to depend on some internet resource.
> 
>> Finally, if you happen to a large enough file already, could you not use 
>> "split" (unix) function to give you a specific sized file?
> 
> Yes, but we'll first need that large enough file :-).
> 
>> 
>> Actually - I was so intrigued by the "challenge" of this - I have had a 
>> think of it over lunch;
>> 
>> Could you not do this?
>> (pseudo -code)
>> 
>> Start with a known "chunk" - say the license file.
>> get length of license file - for example only  = 125bytes
>> append the chunk to itself until such time as you have the required size.
>> write the file once.
>> 
>> startSize = length(knownFile);
>> int numberOfLoops = log2(desiredFileSize / knownFile) ;
>> newFile = knownFile
>> 
>> Loop for numberOfLoops
>> {
>>      newFile = newFile +newFile
>> }
>> 
>> write newFile;
> 
> Yes, that certainly seems a sensible approach. I actually had the same
> idea (see below, "doubling the file until I have enough (cat file.txt
>>> file.txt)). Well, it's almost the same: in your suggestion the file
> is built up completely in memory and only written out in the end.
> That's also an option, but I think it's better to write it out every
> time it's doubled (to avoid using up too much memory).

I certainly understand that, But in todays GB RAM PCs / Laptops;
Is a 100 MB file in memory such a worrisome issue?
It would only be short lived - until the file was written out. The memory could 
then be cleared?
And from your first email - it seemed like timing was the hurdle to clear and 
file open / write / close every iteration of the loop could certainly be the 
bottleneck of your first attempt.

Anyway - Please be sure not to let me "bully" you into anything you don;t think 
is right for your requirements :)
Andy (our intern) might be able to provide the missing link for you.

none the less, Good Luck.

<snip>


> 
> But good suggestion anyway. I think I will go with some variation of
> this (starting with a large enough chunk of representative content).
> 
> Cheers,
> Johan
> 
>> On 23/12/2010, at 11:51 AM, Johan Corveleyn wrote:
>> 
>>> On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin
>>> <[email protected]> wrote:
>>>> Johan Corveleyn <[email protected]> writes:
>>>> 
>>>>> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin
>>>>> <[email protected]> wrote:
>>>>>> Johan Corveleyn <[email protected]> writes:
>>>>>> 
>>>>>>> This makes the diff algorithm another 10% - 15%
>>>>>>> faster (granted, this was measured with my "extreme" testcase of a 1,5
>>>>>>> Mb file (60000 lines), of which most lines are identical
>>>>>>> prefix/suffix).
>>>>>> 
>>>>>> Can you provide a test script?  Or decribe the test more fully, please.
>>>>> 
>>>>> Hmm, it's not easy to come up with a test script to test this "from
>>>>> scratch" (unless with testing diff directly, see below). I test it
>>>>> with a repository (a dump/load of an old version of our production
>>>>> repository) which contains this 60000 line xml file (1,5 Mb) with 2272
>>>>> revisions.
>>>>> 
>>>>> I run blame on this file, over svnserve protocol on localhost (server
>>>>> running on same machine), with an svnserve built from Stefan^2's
>>>>> performance branch (with membuffer caching of full-texts, so server
>>>>> I/O is not the bottleneck). This gives me an easy way to call 2272
>>>>> times diff on this file, and measure it (with the help of some
>>>>> instrumentation code in blame.c, see attachment). And it's
>>>>> incidentally the actual use case I first started out wanting to
>>>>> optimize (blame for large files with many revisions).
>>>> 
>>>> Testing with real-world data is important, perhaps even more important
>>>> than artificial test data, but some test data would be useful.  If you
>>>> were to write a script to generate two test files of size 100MB, say,
>>>> then you could use the tools/diff/diff utility to run Subversion diff on
>>>> those two files.  Or tools/diff/diff3 if it's a 3-way diff that matters.
>>>> The first run might involve disk IO, but on most machines the OS should
>>>> be able to cache the files and subsequent hot-cache runs should be a
>>>> good way to profile the diff code, assumming it is CPU limited.
>>> 
>>> Yes, that's a good idea. I'll try to spend some time on that. But I'm
>>> wondering about a good way to write such a script.
>>> 
>>> I'd like the script to generate large files quickly, and with content
>>> that's not totally random, but also not 1000000 times the exact same
>>> line (either of those are not going to be representative for real
>>> world data, might hit some edge behavior of the diff algorithm).
>>> (maybe totally random is fine, but is there an easy/fast way to
>>> generate this?)
>>> 
>>> As a first attempt, I quickly hacked up a small shell script, writing
>>> out lines in a for loop, one by one, with a fixed string together with
>>> the line number (index of the iteration). But that's too slow (10000
>>> lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds).
>>> 
>>> Maybe I can start with 10 or 20 different lines (or generate 100 in a
>>> for loop), and then start doubling that until I have enough (cat
>>> file.txt >> file.txt). That will probably be faster. And it might be
>>> "real-worldish" enough (a single source file also contains many
>>> identical lines, e.g. all lines with a single brace etc.).
>>> 
>>> Other ideas? Maybe there is already something like this lying around?
>>> 
>>> Another question: a shell script might not be good, because not
>>> portable (and not fast)? Should I use python for this? Maybe the
>>> "write line by line with a line number in a for loop" would be a lot
>>> faster in Python? I don't know a lot of python, but it might be a good
>>> opportunity to learn some ...
>>> 
>>> Are there any examples of such "manual test scripts" in svn? So I
>>> could have a look at the style, coding habits, ... maybe borrow some
>>> boilerplate code?
>>> 
>>> Cheers,
>>> --
>>> Johan
>> 
>>

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Reply via email to