On 26 Aug, 15:45, vasudevram <[EMAIL PROTECTED]> wrote: > On Aug 26, 6:48 am, Paul McGuire <[EMAIL PROTECTED]> wrote: > > > > > On Aug 25, 8:15 pm, Paul McGuire <[EMAIL PROTECTED]> wrote: > > > > On Aug 25, 4:57 am,mosscliffe<[EMAIL PROTECTED]> wrote: > > > > > I have 4 text files each approx 50mb. > > > > <yawn> 50mb? Really? Did you actually try this and find out it was a > > > problem? > > > > Try this: > > > import time > > > > start = time.clock() > > > outname = "temp.dat" > > > outfile = file(outname,"w") > > > for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']: > > > infile = file(inname) > > > outfile.write( infile.read() ) > > > infile.close() > > > outfile.close() > > > end = time.clock() > > > > print end-start,"seconds" > > > > For 4 30Mb files, this takes just over 1.3 seconds on my system. (You > > > may need to open files in binary mode, depending on the contents, but > > > I was in a hurry.) > > > > -- Paul > > > My bad, my test file was not a text file, but a binary file. > > Retesting with a 50Mb text file took 24.6 seconds on my machine. > > > Still in your working range? If not, then you will need to pursue > > more exotic approaches. But 25 seconds on an infrequent basis does > > not sound too bad, especially since I don't think you will really get > > any substantial boost from them (to benchmark this, I timed a raw > > "copy" command at the OS level of the resulting 200Mb file, and this > > took about 20 seconds). > > > Keep it simple. > > > -- Paul > > There are (at least) another couple of approaches possible, each with > some possible tradeoffs or requirements: > > Approach 1. (Least amount of code to write - not that the others are > large :) > > Just use os.system() and the UNIX cat command - the requirement here > is that: > a) your web site is hosted on *nix (ok, you can do it if on Windows > too - use copy instead of cat, you might have to add a "cmd /c " > prefix in front of the copy command, and you have to use the right > copy command syntax for concatenating multiple input files into one > output file). > > b) your hosting plan allows you to execute OS level commands like cat, > and cat is in your OS PATH (not PYTHONPATH). (Similar comments apply > for Windows hosts). > > import os > os.system("cat file1.txt file2.txt file3.txt file4.txt > > file_out.txt") > > cat will take care of buffering, etc. transparently to you. > > Approach 2: Read (in a loop, as you originally thought of doing) each > line of each of the 4 input files and write it to the output file: > > ("Reusing" Paul McGuire's code above:) > > outname = "temp.dat" > outfile = file(outname,"w") > for inname in ['file1.dat', 'file2.dat', 'file3.dat', 'file4.dat']: > infile = file(inname) > for lin in infile: > outfile.write(lin) > infile.close() > outfile.close() > end = time.clock() > > print end-start,"seconds" > > # You may need to check that newlines are not removed in the above > code, in the output file. Can't remember right now. If they are, just > add one back with: > > outfile.write(lin + "\n") instead of outfile.write(lin) . > > ( Code not tested, test it locally first, though looks ok to me. ) > > The reason why this _may_ not be much slower than manually coded > buffering approaches, is that: > > a) Python's standard library is written in C (which is fast), > including use of stdio (the C Standard IO library, which already does > intelligent buffering) > b) OS's do I/O buffering anyway, so do hard disk controllers > c) from some recent Python version, I think it was 2.2, that idiom > "for lin in infile" has been (based on somethng I read in the Python > Cookbook) stated to be pretty efficient anyway (and yet (slightly) > more readable that earlier followed approaches of reading a text > file). > > Given all the above facts, it probably isn't worth your while to try > and optimize the code unless and until you find (by measurements) that > it's too slow - which is a good practice anyway: > > http://en.wikipedia.org/wiki/Optimization_(computer_science) > > Excerpt from the above page (its long but worth reading, IMO): > > "Donald Knuth said, paraphrasing Hoare[1], > > "We should forget about small efficiencies, say about 97% of the time: > premature optimization is the root of all evil." (Knuth, Donald. > Structured Programming with go to Statements, ACM Journal Computing > Surveys, Vol 6, No. 4, Dec. 1974. p.268.) > > Charles Cook commented, > > "I agree with this. It's usually not worth spending a lot of time > micro-optimizing code before it's obvious where the performance > bottlenecks are. But, conversely, when designing software at a system > level, performance issues should always be considered from the > beginning. A good software developer will do this automatically, > having developed a feel for where performance issues will cause > problems. An inexperienced developer will not bother, misguidedly > believing that a bit of fine tuning at a later stage will fix any > problems." [2] > " > > HTH > Vasudev > ----------------------------------------- > Vasudev > Ramhttp://www.dancingbison.comhttp://jugad.livejournal.comhttp://sourceforge.net/projects/xtopdf > -----------------------------------------
All, Thank you very much. As my background is much smaller memory machines than today's giants - 64k being a big machine and 640k being gigantic. I get very worried about crashing machines when copying or editing big files, especially in a multi-user environment. Mr Knuth - that brings back memories. I rememeber implementing some of his sort routines on a mainframe with 24 tape units and an 8k drum and almost eliminating one shift per day of computer operator time. Thanks again Richard -- http://mail.python.org/mailman/listinfo/python-list