On Thu, Dec 23, 2010 at 11:43 PM, Gavin Beau Baumanis <gav...@thespidernet.com> wrote: > On 23/12/2010, at 11:36 PM, Johan Corveleyn wrote: > >> On Thu, Dec 23, 2010 at 3:44 AM, Gavin Beau Baumanis >> <gav...@thespidernet.com> wrote: >>> Hi Johan, >> >> Hi Gavin. Thanks for your interest. It's a nice problem isn't it? > > Yes - it "should be" so very simple... but a little thought - and before you > know it - it's not! > > >> >>> I was intrigued by your requirement to create a large file for testing. >>> I remember from a really long time ago when I learnt C, that we used a >>> specific algorithm for creating "natural" and "random" text. >>> With some help from Mr.Google found out about Markov Chains that look >>> promising - I can't remember if that was what I learned about or not - but >>> it looks like it might be a prove helpful none the less. >>> >>> A little further Googlng and I found this specific post on stackoverflow. >>> http://stackoverflow.com/questions/1037719/how-can-i-quickly-create-large-1gb-textbinary-files-with-natural-content >> >> Interesting link. If I'm reading it correctly, the best suggestion in >> there was with those Markov Chains. But it seems maybe a little >> overkill / too much work for me. > > I might have an answer for you there. > I pointed out this thread to our intern. > He seems to know about Marchov chains already - I have asked him to > contribute to the thread. > > >> >>> No Idea if it is going to help you specifically or not... but there are >>> quite a few ideas in the comments; >>> * Obtain a copy of the first 100MB from wikipedia - for example. >> >> Nah, I wouldn't like to depend on some internet resource. >> >>> Finally, if you happen to a large enough file already, could you not use >>> "split" (unix) function to give you a specific sized file? >> >> Yes, but we'll first need that large enough file :-). >> >>> >>> Actually - I was so intrigued by the "challenge" of this - I have had a >>> think of it over lunch; >>> >>> Could you not do this? >>> (pseudo -code) >>> >>> Start with a known "chunk" - say the license file. >>> get length of license file - for example only = 125bytes >>> append the chunk to itself until such time as you have the required size. >>> write the file once. >>> >>> startSize = length(knownFile); >>> int numberOfLoops = log2(desiredFileSize / knownFile) ; >>> newFile = knownFile >>> >>> Loop for numberOfLoops >>> { >>> newFile = newFile +newFile >>> } >>> >>> write newFile; >> >> Yes, that certainly seems a sensible approach. I actually had the same >> idea (see below, "doubling the file until I have enough (cat file.txt >>>> file.txt)). Well, it's almost the same: in your suggestion the file >> is built up completely in memory and only written out in the end. >> That's also an option, but I think it's better to write it out every >> time it's doubled (to avoid using up too much memory). > > I certainly understand that, But in todays GB RAM PCs / Laptops; > Is a 100 MB file in memory such a worrisome issue? > It would only be short lived - until the file was written out. The memory > could then be cleared? > And from your first email - it seemed like timing was the hurdle to clear and > file open / write / close every iteration of the loop could certainly be the > bottleneck of your first attempt. > > Anyway - Please be sure not to let me "bully" you into anything you don;t > think is right for your requirements :) > Andy (our intern) might be able to provide the missing link for you. > > none the less, Good Luck.
Thanks, but maybe it will not be needed. I've found a way which is good enough for me at the moment (although it will not hurt to have a better algorithm etc., so feel free to go ahead anyway). I've thought/tried a bit harder, and improved on my first naive attempt: >>>> As a first attempt, I quickly hacked up a small shell script, writing >>>> out lines in a for loop, one by one, with a fixed string together with >>>> the line number (index of the iteration). But that's too slow (10000 >>>> lines of 70 bytes, i.e. 700Kb, is already taking 14 seconds). That wasn't very smart :-). Writing line-by-line into a file, tssss. If I switch to writing lines to a shell variable, and then write it out to a file, the 10000 lines are done in 1 second: #!/bin/sh CHUNK_SIZE=10000 # 10000 lines LINE="This is a plain line of text in the file. It's actually line number" CHUNK=`for (( i=0; i < $CHUNK_SIZE; i++ )); do echo "$LINE $i."; done` echo "$CHUNK" > $FILENAME Appending such CHUNKS in a for-loop a couple of times, quickly generates a file that's large enough, with some repetition of lines, and not too much memory usage. In attachment a version of the script "generate_big_testfiles.sh". It writes 10 of such 10000-line chunks to the "original file", and the same to the "modified file" with 100 changed lines inserted in the middle. That gives me two files of ~7 Mb, in around 7 seconds on my machine. I will continue to fiddle a little bit with this script, suggestions welcome :-). I'm currently having some trouble building the tools directory, but once I get that fixed, I'll test the tools/diff executables with these big test files. Cheers, -- Johan
generate_big_testfiles.sh
Description: Bourne shell script