I have tried running it just on the csv read:
...
print "finished: %f.2" % (t1 - t0)

I presume you wanted "%.2f" here. :)

$ ./largefilespeedtest.py
working at file largefile.txt
finished: 3.860000.2

So just the CSV processing of the file takes just shy of 4 seconds and you said that just the pure file-read took about a second, so that leaves about 3 seconds for CSV processing (or about 1/3 of the total runtime). In your code example in your 2nd post (with the timing in it), it looks like it took 15+ seconds, meaning the csv code is a mere 1/5 of the runtime. I also notice that you're reading the file once to find the length, and reading again to process it.

The csv files are a chromosome name,
a coordinate and a data point, like this:

chr1    3754914 1.19828
chr1    3754950 1.56557
chr1    3754982 1.52371

Depending on the simplicity of the file-format (assuming nothing like spaces/tabs in the chromosome name, which your dictionary seems to indicate is the case), it may be faster to use .split() to do the work:

  for line in file(afile):
     a,b,c = line.rstrip('\n\r').split()

The csv module does a lot of smart stuff that it looks like you may not need.

However, you're still only cutting from that 3-second subset of your total time. Focusing on the "filing it into very simple data structures" will likely net you greater improvements. I don't have much experience with numpy, so I can't offer much to help. However, rather than reading the file twice, you might try a general heuristic, assuming lines are no longer than N characters (they look like they're each 20 chars + a newline) and then using "filesize/N" to estimate an adequately sized array. Using stat() on a file to get its size will be a heckuva lot faster than reading the whole file. I also don't know the performance of cStringIO.CString() with lots of appending. However, since each write is just a character, you might do well to use the array module (unless numpy also has char-arrays) to preallocate n chars just like you do with your ints and floats:

  chromeio[count] = chrommap[chrom]
  coords[count] = coord
  points[count] = point
  count += 1

Just a few ideas to try.

-tkc





--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to