Re: CSV performance

2009-04-29 Thread psaff...@googlemail.com
> > rows = fh.read().split() > coords = numpy.array(map(int, rows[1::3]), dtype=int) > points = numpy.array(map(float, rows[2::3]), dtype=float) > chromio.writelines(map(chrommap.__getitem__, rows[::3])) > My original version is about 15 seconds. This version is about 9. The chunks version posted

Re: CSV performance

2009-04-29 Thread Lawrence D'Oliveiro
In message , Jorgen Grahn wrote: > I am asking because people who like databases tend to overestimate the > time it takes to parse text. And those of us who regularly load databases from text files, or unload them in the opposite direction, have a good idea of EXACTLY how long it takes to pars

Re: CSV performance

2009-04-29 Thread Jorgen Grahn
On Mon, 27 Apr 2009 23:56:47 +0200, dean wrote: > On Mon, 27 Apr 2009 04:22:24 -0700 (PDT), psaff...@googlemail.com wrote: > >> I'm using the CSV library to process a large amount of data - 28 >> files, each of 130MB. Just reading in the data from one file and >> filing it into very simple data st

Re: CSV performance

2009-04-29 Thread Lawrence D'Oliveiro
In message , Peter Otten wrote: > When I see the sequence > > save state > change state > do something > restore state > > I feel compelled to throw in a try ... finally Yeah, but I try to avoid using exceptions to that extent. :) -- http://mail.python.org/mailman/listinfo/python-list

Re: CSV performance

2009-04-29 Thread Peter Otten
Lawrence D'Oliveiro wrote: > In message , Peter Otten wrote: > >> gc.disable() >> # create many small objects that you want to keep >> gc.enable() > > Every time I see something like this, I feel the urge to save the previous > state and restore it afterwards: > > save_enabled = gc.isenable

Re: CSV performance

2009-04-28 Thread Lawrence D'Oliveiro
In message , Peter Otten wrote: > gc.disable() > # create many small objects that you want to keep > gc.enable() Every time I see something like this, I feel the urge to save the previous state and restore it afterwards: save_enabled = gc.isenabled() gc.disable() # create many small

Re: CSV performance

2009-04-27 Thread dean
On Mon, 27 Apr 2009 04:22:24 -0700 (PDT), psaff...@googlemail.com wrote: > I'm using the CSV library to process a large amount of data - 28 > files, each of 130MB. Just reading in the data from one file and > filing it into very simple data structures (numpy arrays and a > cstringio) takes around

Re: CSV performance

2009-04-27 Thread Scott David Daniels
psaff...@googlemail.com wrote: Thanks for your replies. Many apologies for not including the right information first time around. More information is below Here is another way to try (untested): import numpy import time chrommap = dict(chrY='y', chrX='x', chr13='c', chr12='b', chr11='a',

Re: CSV performance

2009-04-27 Thread Peter Otten
psaff...@googlemail.com wrote: > Thanks for your replies. Many apologies for not including the right > information first time around. More information is below. > > I have tried running it just on the csv read: > $ ./largefilespeedtest.py > working at file largefile.txt > finished: 3.86.2 >

Re: CSV performance

2009-04-27 Thread Tim Chase
I have tried running it just on the csv read: ... print "finished: %f.2" % (t1 - t0) I presume you wanted "%.2f" here. :) $ ./largefilespeedtest.py working at file largefile.txt finished: 3.86.2 So just the CSV processing of the file takes just shy of 4 seconds and you said that just

Re: CSV performance

2009-04-27 Thread Peter Otten
lines and set the >> > variables? >> >> > Is there some way I can improve the CSV performance? >> >> My ideas: >> >> (1) Disable cyclic garbage collection while you read the file into your >> data structure: >> >> import gc >&g

Re: CSV performance

2009-04-27 Thread grocery_stocker
a structures (numpy arrays and a > > cstringio) takes around 10 seconds. If I just slurp one file into a > > string, it only takes about a second, so I/O is not the bottleneck. Is > > it really taking 9 seconds just to split the lines and set the > > variables? > > >

Re: CSV performance

2009-04-27 Thread psaff...@googlemail.com
Thanks for your replies. Many apologies for not including the right information first time around. More information is below. I have tried running it just on the csv read: import time import csv afile = "largefile.txt" t0 = time.clock() print "working at file", afile reader = csv.reader(open(a

Re: CSV performance

2009-04-27 Thread Tim Chase
I'm using the CSV library to process a large amount of data - 28 files, each of 130MB. Just reading in the data from one file and filing it into very simple data structures (numpy arrays and a cstringio) takes around 10 seconds. If I just slurp one file into a string, it only takes about a second,

Re: CSV performance

2009-04-27 Thread Peter Otten
urp one file into a > string, it only takes about a second, so I/O is not the bottleneck. Is > it really taking 9 seconds just to split the lines and set the > variables? > > Is there some way I can improve the CSV performance? My ideas: (1) Disable cyclic garbage collection w

Re: CSV performance

2009-04-27 Thread John Machin
t's a problem with the csv module and not with the "filing it into very simple data structures"? How long does it take just to read the CSV file i.e. without any setting the variables? Have you run your timing tests multiple times and discarded the first 1 or two results? > Is ther

CSV performance

2009-04-27 Thread psaff...@googlemail.com
econd, so I/O is not the bottleneck. Is it really taking 9 seconds just to split the lines and set the variables? Is there some way I can improve the CSV performance? Is there a way I can slurp the file into memory and read it like a file from there? Peter -- http://mail.python.org/mailman/lis