grocery_stocker wrote: > On Apr 27, 5:15 am, Peter Otten <__pete...@web.de> wrote: >> psaff...@googlemail.com wrote: >> > I'm using the CSV library to process a large amount of data - 28 >> > files, each of 130MB. Just reading in the data from one file and >> > filing it into very simple data structures (numpy arrays and a >> > cstringio) takes around 10 seconds. If I just slurp one file into a >> > string, it only takes about a second, so I/O is not the bottleneck. Is >> > it really taking 9 seconds just to split the lines and set the >> > variables? >> >> > Is there some way I can improve the CSV performance? >> >> My ideas: >> >> (1) Disable cyclic garbage collection while you read the file into your >> data structure: >> >> import gc >> >> gc.disable() >> # create many small objects that you want to keep >> gc.enable() >> >> (2) If your data contains only numerical data without quotes use >> >> numpy.fromfile() >> > > How would disabling the cyclic garbage collection make it go faster in > this case?
When Python creates many objects and doesn't release any it is assumed that they are kept due to cyclic references. When you know that you actually want to keep all those objects you can temporarily disable garbage collection. E. g.: $ cat gcdemo.py import time import sys import gc def main(float=float): if "-d" in sys.argv: gc.disable() status = "disabled" else: status = "enabled" all = [] append = all.append start = time.time() floats = ["1.234"] * 10 assert len(set(map(id, map(float, floats)))) == len(floats) for _ in xrange(10**6): append(map(float, floats)) print time.time() - start, "(garbage collection %s)" % status main() $ python gcdemo.py -d 11.6144971848 (garbage collection disabled) $ python gcdemo.py 15.5317759514 (garbage collection enabled) Of course I don't know whether this is actually a problem for the OP's code. Peter -- http://mail.python.org/mailman/listinfo/python-list