On 23/12/2010, at 11:36 PM, Johan Corveleyn wrote: > On Thu, Dec 23, 2010 at 3:44 AM, Gavin Beau Baumanis > <gav...@thespidernet.com> wrote: >> Hi Johan, > > Hi Gavin. Thanks for your interest. It's a nice problem isn't it?
Yes - it "should be" so very simple... but a little thought - and before you know it - it's not! > >> I was intrigued by your requirement to create a large file for testing. >> I remember from a really long time ago when I learnt C, that we used a >> specific algorithm for creating "natural" and "random" text. >> With some help from Mr.Google found out about Markov Chains that look >> promising - I can't remember if that was what I learned about or not - but >> it looks like it might be a prove helpful none the less. >> >> A little further Googlng and I found this specific post on stackoverflow. >> http://stackoverflow.com/questions/1037719/how-can-i-quickly-create-large-1gb-textbinary-files-with-natural-content > > Interesting link. If I'm reading it correctly, the best suggestion in > there was with those Markov Chains. But it seems maybe a little > overkill / too much work for me. I might have an answer for you there. I pointed out this thread to our intern. He seems to know about Marchov chains already - I have asked him to contribute to the thread. > >> No Idea if it is going to help you specifically or not... but there are >> quite a few ideas in the comments; >> * Obtain a copy of the first 100MB from wikipedia - for example. > > Nah, I wouldn't like to depend on some internet resource. > >> Finally, if you happen to a large enough file already, could you not use >> "split" (unix) function to give you a specific sized file? > > Yes, but we'll first need that large enough file :-). > >> >> Actually - I was so intrigued by the "challenge" of this - I have had a >> think of it over lunch; >> >> Could you not do this? >> (pseudo -code) >> >> Start with a known "chunk" - say the license file. >> get length of license file - for example only = 125bytes >> append the chunk to itself until such time as you have the required size. >> write the file once. >> >> startSize = length(knownFile); >> int numberOfLoops = log2(desiredFileSize / knownFile) ; >> newFile = knownFile >> >> Loop for numberOfLoops >> { >> newFile = newFile +newFile >> } >> >> write newFile; > > Yes, that certainly seems a sensible approach. I actually had the same > idea (see below, "doubling the file until I have enough (cat file.txt >>> file.txt)). Well, it's almost the same: in your suggestion the file > is built up completely in memory and only written out in the end. > That's also an option, but I think it's better to write it out every > time it's doubled (to avoid using up too much memory). I certainly understand that, But in todays GB RAM PCs / Laptops; Is a 100 MB file in memory such a worrisome issue? It would only be short lived - until the file was written out. The memory could then be cleared? And from your first email - it seemed like timing was the hurdle to clear and file open / write / close every iteration of the loop could certainly be the bottleneck of your first attempt. Anyway - Please be sure not to let me "bully" you into anything you don;t think is right for your requirements :) Andy (our intern) might be able to provide the missing link for you. none the less, Good Luck. <snip> > > But good suggestion anyway. I think I will go with some variation of > this (starting with a large enough chunk of representative content). > > Cheers, > Johan > >> On 23/12/2010, at 11:51 AM, Johan Corveleyn wrote: >> >>> On Wed, Dec 22, 2010 at 11:50 AM, Philip Martin >>> <philip.mar...@wandisco.com> wrote: >>>> Johan Corveleyn <jcor...@gmail.com> writes: >>>> >>>>> On Mon, Dec 20, 2010 at 11:19 AM, Philip Martin >>>>> <philip.mar...@wandisco.com> wrote: >>>>>> Johan Corveleyn <jcor...@gmail.com> writes: >>>>>> >>>>>>> This makes the diff algorithm another 10% - 15% >>>>>>> faster (granted, this was measured with my "extreme" testcase of a 1,5 >>>>>>> Mb file (60000 lines), of which most lines are identical >>>>>>> prefix/suffix). >>>>>> >>>>>> Can you provide a test script? Or decribe the test more fully, please. >>>>> >>>>> Hmm, it's not easy to come up with a test script to test this "from >>>>> scratch" (unless with testing diff directly, see below). I test it >>>>> with a repository (a dump/load of an old version of our production >>>>> repository) which contains this 60000 line xml file (1,5 Mb) with 2272 >>>>> revisions. >>>>> >>>>> I run blame on this file, over svnserve protocol on localhost (server >>>>> running on same machine), with an svnserve built from Stefan^2's >>>>> performance branch (with membuffer caching of full-texts, so server >>>>> I/O is not the bottleneck). This gives me an easy way to call 2272 >>>>> times diff on this file, and measure it (with the help of some >>>>> instrumentation code in blame.c, see attachment). And it's >>>>> incidentally the actual use case I first started out wanting to >>>>> optimize (blame for large files with many revisions). >>>> >>>> Testing with real-world data is important, perhaps even more important >>>> than artificial test data, but some test data would be useful. If you >>>> were to write a script to generate two test files of size 100MB, say, >>>> then you could use the tools/diff/diff utility to run Subversion diff on >>>> those two files. Or tools/diff/diff3 if it's a 3-way diff that matters. >>>> The first run might involve disk IO, but on most machines the OS should >>>> be able to cache the files and subsequent hot-cache runs should be a >>>> good way to profile the diff code, assumming it is CPU limited. >>> >>> Yes, that's a good idea. I'll try to spend some time on that. But I'm >>> wondering about a good way to write such a script. >>> >>> I'd like the script to generate large files quickly, and with content >>> that's not totally random, but also not 1000000 times the exact same >>> line (either of those are not going to be representative for real >>> world data, might hit some edge behavior of the diff algorithm). >>> (maybe totally random is fine, but is there an easy/fast way to >>> generate this?) >>> >>> As a first attempt, I quickly hacked up a small shell script, writing >>> out lines in a for loop, one by one, with a fixed string together with >>> the line number (index of the iteration). But that's too slow (10000 >>> lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds). >>> >>> Maybe I can start with 10 or 20 different lines (or generate 100 in a >>> for loop), and then start doubling that until I have enough (cat >>> file.txt >> file.txt). That will probably be faster. And it might be >>> "real-worldish" enough (a single source file also contains many >>> identical lines, e.g. all lines with a single brace etc.). >>> >>> Other ideas? Maybe there is already something like this lying around? >>> >>> Another question: a shell script might not be good, because not >>> portable (and not fast)? Should I use python for this? Maybe the >>> "write line by line with a line number in a for loop" would be a lot >>> faster in Python? I don't know a lot of python, but it might be a good >>> opportunity to learn some ... >>> >>> Are there any examples of such "manual test scripts" in svn? So I >>> could have a look at the style, coding habits, ... maybe borrow some >>> boilerplate code? >>> >>> Cheers, >>> -- >>> Johan >> >>