Thanks for your replies. Many apologies for not including the right
information first time around. More information is below.
I have tried running it just on the csv read:

import time
import csv

afile = "largefile.txt"

t0 = time.clock()

print "working at file", afile
reader = csv.reader(open(afile, "r"), delimiter="\t")
for row in reader:
        x,y,z = row


t1 = time.clock()

print "finished: %f.2" % (t1 - t0)


$ ./largefilespeedtest.py
working at file largefile.txt
finished: 3.860000.2


A tiny bit of background on the final application: this is biological
data from an affymetrix platform. The csv files are a chromosome name,
a coordinate and a data point, like this:

chr1    3754914 1.19828
chr1    3754950 1.56557
chr1    3754982 1.52371

In the "simple data structures" cod below, I do some jiggery pokery
with the chromosome names to save me storing the same string millions
of times.


import csv
import cStringIO
import numpy
import time

afile = "largefile.txt"

chrommap = {'chrY': 'y', 'chrX': 'x', 'chr13': 'c',
                        'chr12': 'b', 'chr11': 'a', 'chr10': '0',
                        'chr17': 'g', 'chr16': 'f', 'chr15': 'e',
                        'chr14': 'd', 'chr19': 'i', 'chr18': 'h',
                        'chrM': 'm', 'chr22': 'l', 'chr20': 'j',
                        'chr21': 'k', 'chr7': '7', 'chr6': '6',
                        'chr5': '5', 'chr4': '4', 'chr3': '3',
                        'chr2': '2', 'chr1': '1', 'chr9': '9', 'chr8': '8'}


def getFileLength(fh):
        wholefile = fh.read()
        numlines = wholefile.count("\n")
        fh.seek(0)
        return numlines

count = 0
print "reading affy file", afile
fh = open(afile)
n = getFileLength(fh)
chromio = cStringIO.StringIO()
coords = numpy.zeros(n, dtype=int)
points = numpy.zeros(n)

t0 = time.clock()
reader = csv.reader(fh, delimiter="\t")
for row in reader:
        if not row:
                continue
        chrom, coord, point = row
        mappedc = chrommap[chrom]
        chromio.write(mappedc)
        coords[count] = coord
        points[count] = point
        count += 1
t1 = time.clock()

print "finished: %f.2" % (t1 - t0)


$ ./affyspeedtest.py
reading affy file largefile.txt
finished: 15.540000.2


Thanks again (tugs forelock),

Peter
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to