Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Johan Corveleyn Thu, 23 Dec 2010 15:17:16 -0800

On Thu, Dec 23, 2010 at 11:43 PM, Gavin Beau Baumanis
<gav...@thespidernet.com> wrote:
> On 23/12/2010, at 11:36 PM, Johan Corveleyn wrote:
>
>> On Thu, Dec 23, 2010 at 3:44 AM, Gavin Beau Baumanis
>> <gav...@thespidernet.com> wrote:
>>> Hi Johan,
>>
>> Hi Gavin. Thanks for your interest. It's a nice problem isn't it?
>
> Yes - it "should be" so very simple... but a little thought  - and before you 
> know it - it's not!
>
>
>>
>>> I was intrigued by your requirement to create a large file for testing.
>>> I remember from a really long time ago when I learnt C, that we used a 
>>> specific algorithm for creating "natural" and "random" text.
>>> With some help from Mr.Google found out about Markov Chains that look 
>>> promising - I can't remember if that was what I learned about or not  - but 
>>> it looks like it might be a prove helpful none the less.
>>>
>>> A little further Googlng and I found this specific post on stackoverflow.
>>> http://stackoverflow.com/questions/1037719/how-can-i-quickly-create-large-1gb-textbinary-files-with-natural-content
>>
>> Interesting link. If I'm reading it correctly, the best suggestion in
>> there was with those Markov Chains. But it seems maybe a little
>> overkill / too much work for me.
>
> I might have an answer for you there.
> I pointed out this thread to our intern.
> He seems to know about Marchov chains already - I have asked him to 
> contribute to the thread.
>
>
>>
>>> No Idea if it is going to help you specifically or not... but there are 
>>> quite a few ideas in the comments;
>>> * Obtain a copy of the first 100MB from wikipedia - for example.
>>
>> Nah, I wouldn't like to depend on some internet resource.
>>
>>> Finally, if you happen to a large enough file already, could you not use 
>>> "split" (unix) function to give you a specific sized file?
>>
>> Yes, but we'll first need that large enough file :-).
>>
>>>
>>> Actually - I was so intrigued by the "challenge" of this - I have had a 
>>> think of it over lunch;
>>>
>>> Could you not do this?
>>> (pseudo -code)
>>>
>>> Start with a known "chunk" - say the license file.
>>> get length of license file - for example only  = 125bytes
>>> append the chunk to itself until such time as you have the required size.
>>> write the file once.
>>>
>>> startSize = length(knownFile);
>>> int numberOfLoops = log2(desiredFileSize / knownFile) ;
>>> newFile = knownFile
>>>
>>> Loop for numberOfLoops
>>> {
>>>      newFile = newFile +newFile
>>> }
>>>
>>> write newFile;
>>
>> Yes, that certainly seems a sensible approach. I actually had the same
>> idea (see below, "doubling the file until I have enough (cat file.txt
>>>> file.txt)). Well, it's almost the same: in your suggestion the file
>> is built up completely in memory and only written out in the end.
>> That's also an option, but I think it's better to write it out every
>> time it's doubled (to avoid using up too much memory).
>
> I certainly understand that, But in todays GB RAM PCs / Laptops;
> Is a 100 MB file in memory such a worrisome issue?
> It would only be short lived - until the file was written out. The memory 
> could then be cleared?
> And from your first email - it seemed like timing was the hurdle to clear and 
> file open / write / close every iteration of the loop could certainly be the 
> bottleneck of your first attempt.
>
> Anyway - Please be sure not to let me "bully" you into anything you don;t 
> think is right for your requirements :)
> Andy (our intern) might be able to provide the missing link for you.
>
> none the less, Good Luck.


Thanks, but maybe it will not be needed. I've found a way which is
good enough for me at the moment (although it will not hurt to have a
better algorithm etc., so feel free to go ahead anyway).

I've thought/tried a bit harder, and improved on my first naive attempt:

>>>> As a first attempt, I quickly hacked up a small shell script, writing
>>>> out lines in a for loop, one by one, with a fixed string together with
>>>> the line number (index of the iteration). But that's too slow (10000
>>>> lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds).

That wasn't very smart :-). Writing line-by-line into a file, tssss.

If I switch to writing lines to a shell variable, and then write it
out to a file, the 10000 lines are done in 1 second:
#!/bin/sh
CHUNK_SIZE=10000         # 10000 lines
LINE="This is a plain line of text in the file. It's actually line number"
CHUNK=`for (( i=0; i < $CHUNK_SIZE; i++ )); do echo "$LINE $i."; done`
echo "$CHUNK" > $FILENAME

Appending such CHUNKS in a for-loop a couple of times, quickly
generates a file that's large enough, with some repetition of lines,
and not too much memory usage.

In attachment a version of the script "generate_big_testfiles.sh". It
writes 10 of such 10000-line chunks to the "original file", and the
same to the "modified file" with 100 changed lines inserted in the
middle. That gives me two files of ~7 Mb, in around 7 seconds on my
machine.

I will continue to fiddle a little bit with this script, suggestions
welcome :-).

I'm currently having some trouble building the tools directory, but
once I get that fixed, I'll test the tools/diff executables with these
big test files.

Cheers,
-- 
Johan

generate_big_testfiles.sh
Description: Bourne shell script

Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?)

Reply via email to