Object cleanup
I am writing a screen scraping application using BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ (which is fantastic, by the way). I have an object that has two methods, each of which loads an HTML document and scrapes out some information, putting strings from the HTML documents into lists and dictionaries. I have a set of these objects from which I am aggregating and returning data. With a large number of these objects, the memory footprint is very large. The "soup" object is a local variable to each scraping method, so I assumed it would be cleaned up after the method had returned. However, I've found that using guppy, after the methods have returned most of the memory is being taken up with BeautifulSoup objects of one type or another. I'm not declaring BeautifulSoup objects anywhere else. I've tried assigning None into the "soup" objects at the end of the method calls and calling garbage collection manually, but this doesn't seem to help. I'd like to find out exactly what object "owns" the various BeautifulSoup structures, but I'm quite a new guppy user and I can't figure out how to do this. How do I force the memory for these soup objects to be freed? Is there antyhing else I should be looking at to find out the cause of these problems? Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Object cleanup
Thanks for all the responses. It looks like none of the BeautifulSoup objects have __del__ methods, so I don't think that can be the problem. To answer your other question, guppy was the best match I came up with when looking for a memory profile for Python (or more specifically "Heapy"): http://guppy-pe.sourceforge.net/#Heapy On Thursday, May 31, 2012 2:51:52 AM UTC+1, Steven D'Aprano wrote: > > The destructor doesn't get called into the last reference is gone. > That makes sense, so now I need to track down why there are references to the object when I don't think there should be. Are there any systematic methods for doing this? Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: mod_python: delay in files changing after alteration
On 6 Jan, 23:31, Graham Dumpleton wrote: > Thus, any changes to modules/packages installed on sys.path require a > full restart of Apache to ensure they are loaded by all Apache child > worker processes. > That will be it. I'm pulling in some libraries of my own from elsewhere, which are still being modified to accommodate the web app. These are the changes that are causing the problems. An Apache restart isn't too onerous - I'll just start doing that. Thanks, Peter -- http://mail.python.org/mailman/listinfo/python-list
subprocess.Popen stalls
I'm building a bioinformatics application using the ipcress tool: http://www.ebi.ac.uk/~guy/exonerate/ipcress.man.html I'm using subprocess.Popen to execute ipcress, which takes a group of files full of DNA sequences and returns some analysis on them. Here's a code fragment: cmd = "/usr/bin/ipcress ipcresstmp.txt --sequence /home/pzs/genebuilds/ human/*.fasta" print "checking with ipcress using command", cmd p = Popen(cmd, shell=True, bufsize=100, stdout=PIPE, stderr=PIPE) retcode = p.wait() if retcode != 0: print "ipcress failed with error code:", retcode raise Exception output = p.stdout.read() If I run the command at my shell, it finishes successfully. It takes 30 seconds - it uses 100% of one core and several hundred MB of memory during this time. The output is 220KB of text. However, running it through Python as per the above code, it stalls after 5 seconds not using any processor at all. I've tried leaving it for a few minutes with no change. If I interrupt it, it's at the "retcode = p.wait()" line. I've tried making the bufsize really large and that doesn't seem to help. I'm a bit stuck - any suggestions? This same command has worked fine on other ipcress runs. This one might generate more output than the others, but 220KB isn't that much, is it? Peter Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: subprocess.Popen stalls
On 12 Jan, 15:33, mk wrote: > > Better use communicate() method: > Oh yes - it's right there in the documentation. That worked perfectly. Many thanks, Peter -- http://mail.python.org/mailman/listinfo/python-list
Which core am I running on?
Is there some way I can get at this information at run-time? I'd like to use it to tag diagnostic output dumped during runs using Parallel Python. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Which core am I running on?
On 9 Feb, 12:24, Gerhard Häring wrote: > Looks like I have answered a similar question once, btw. ;-) > Ah, yes - thanks. I did Google for it, but obviously didn't have the right search term. Cheers, Peter -- http://mail.python.org/mailman/listinfo/python-list
Too many open files
I'm building a pipeline involving a number of shell tools. In each case, I create a temporary file using tempfile.mkstmp() and invoke a command ("cmd < /tmp/tmpfile") on it using subprocess.Popen. At the end of each section, I call close() on the file handles and use os.remove() to delete them. Even so I build up file descriptors that eventually give me the "too many files" error message. Am I missing something here? Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Which core am I running on?
On 9 Feb, 12:24, Gerhard Häring wrote: > http://objectmix.com/python/631346-parallel-python.html > Hmm. In fact, this doesn't seem to work for pp. When I run the code below, it says everything is running on the one core. import pp import random import time from string import lowercase ncpus = 3 def timedCharDump(waittime, char): time.sleep(waittime) mycore = open("/proc/%i/stat" % os.getpid()).read().split()[39] print "I'm doing stuff!", mycore, char return char job_server = pp.Server(ncpus, ppservers=()) jobdetails = [ (random.random(), letter) for letter in lowercase ] jobs = [ job_server.submit(timedCharDump,(jinput1, jinput2), (), ("os", "time",)) for jinput1, jinput2 in jobdetails ] for job in jobs: print job() Peter -- http://mail.python.org/mailman/listinfo/python-list
Selecting a different superclass
This might be a pure OO question, but I'm doing it in Python so I'll ask here. I'm writing a number crunching bioinformatics application. Read lots of numbers from files; merge, median and munge; draw plots. I've found that the most critical part of this work is validation and traceability - "where does this final value come from? How has it been combined with other values? Is that right?" My current implementation stores all my values just as floats with a class called PointSet for storing one set of such values, with various mathematical and statistical methods. There are several subclasses of PointSet (IDPointSet, MicroArrayPointSet) for obtaining values from different file types and with different processing pipelines. I'm planning to instead store each value in a TraceablePoint class which has members that describe the processing stages this value has undergone and a TraceablePointSet class to store groups of these - this will contain all the same methods as PointSet, but will operate on TraceablePoints instead of floats. Of course, this will be much slower than just floats, so I'd like to be able to switch it on and off. The problem is that IDPointSet and MicroArrayPointSet will need to inherit from PointSet or TraceablePointSet based on whether I'm handling traceable points or not. Can I select a superclass conditionally like this in Python? Am I trying to do something really evil here? Any other bright ideas on my application also welcome. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Selecting a different superclass
On 17 Dec, 20:33, "Chris Rebert" wrote: > superclass = TraceablePointSet if tracing else PointSet > Perfect - many thanks. Good to know I'm absolved from evil, also ;) Peter -- http://mail.python.org/mailman/listinfo/python-list
mod_python: delay in files changing after alteration
Maybe this is an apache question, in which case apologies. I am running mod_python 3.3.1-3 on apache 2.2.9-7. It works fine, but I find that when I alter a source file during development, it sometimes takes 5 seconds or so for the changes to be seen. This might sound trivial, but when debugging tens of silly errors, it's annoying that I have to keep hitting refresh on my browser waiting for the change to "take". I'm guessing this is just a caching issue of some kind, but can't figure out how to switch it off. Any suggestions? The entry in my apache2.conf looks like this: SetHandler mod_python PythonHandler mod_python.publisher PythonDebug On Thanks, Peter -- http://mail.python.org/mailman/listinfo/python-list
Memory efficient tuple storage
I'm reading in some rather large files (28 files each of 130MB). Each file is a genome coordinate (chromosome (string) and position (int)) and a data point (float). I want to read these into a list of coordinates (each a tuple of (chromosome, position)) and a list of data points. This has taught me that Python lists are not memory efficient, because if I use lists it gets through 100MB a second until it hits the swap space and I have 8GB physical memory in this machine. I can use Python or numpy arrays for the data points, which is much more manageable. However, I still need the coordinates. If I don't keep them in a list, where can I keep them? Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
Thanks for all the replies. First of all, can anybody recommend a good way to show memory usage? I tried heapy, but couldn't make much sense of the output and it didn't seem to change too much for different usages. Maybe I was just making the h.heap() call in the wrong place. I also tried getrusage() in the resource module. That seemed to give 0 for the shared and unshared memory size no matter what I did. I was calling it after the function call the filled up the lists. The memory figures I give in this message come from top. The numpy solution does work, but it uses more than 1GB of memory for one of my 130MB files. I'm using np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6', 'i4', 'f8']}) so shouldn't it use 18 bytes per line? The file has 5832443 lines, which by my arithmetic is around 100MB...? My previous solution - using a python array for the numbers and a list of tuples for the coordinates uses about 900MB. The dictionary solution suggested by Tim got this down to 650MB. If I just ignore the coordinates, this comes down to less than 100MB. I feel sure the list mechanics for storing the coordinates is what is killing me here. As to "work smarter", you could be right, but it's tricky. The 28 files are in 4 groups of 7, so given that each file is about 6 million lines, each group of data points contains about 42 million points. First, I need to divide every point by the median of its group. Then I need to z-score the whole group of points. After this preparation, I need to file each point, based on its coordinates, into other data structures - the genome itself is divided up into bins that cover a range of coordinates, and we file each point into the appropriate bin for the coordinate region it overlaps. Then there operations that combine the values from various bins. The relevant coordinates for these combinations come from more enormous csv files. I've already done all this analysis on smaller datasets, so I'm hoping I won't have to make huge changes just to fit the data into memory. Yes, I'm also finding out how much it will cost to upgrade to 32GB of memory :) Sorry for the long message... Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Memory efficient tuple storage
In the end, I used a cStringIO object to store the chromosomes - because there are only 23, I can use one character for each chromosome and represent the whole lot with a giant string and a dictionary to say what each character means. Then I used numpy arrays for the data and coordinates. This squeezed each file into under 100MB. Thanks again for the help! Peter -- http://mail.python.org/mailman/listinfo/python-list
Parallel processing on shared data structures
I'm filing 160 million data points into a set of bins based on their position. At the moment, this takes just over an hour using interval trees. I would like to parallelise this to take advantage of my quad core machine. I have some experience of Parallel Python, but PP seems to only really work for problems where you can do one discrete bit of processing and recombine these results at the end. I guess I could thread my code and use mutexes to protect the shared lists that everybody is filing into. However, my understanding is that Python is still only using one process so this won't give me multi- core. Does anybody have any suggestions for this? Peter -- http://mail.python.org/mailman/listinfo/python-list
mod_python form upload: permission denied sometimes...
I have a mod_python application that takes a POST file upload from a form. It works fine from my machine, other machines in my office and my home machine. It does not work from my bosses machine in a different city - he gets "You don't have permission to access this on this server". In the logs, it's returned 403. I also have this error in error.log: Cannot traverse upload in /pythonapps/wiggle/form/upload because is not a traversable object, referer: ... Could this be a network level problem? If so, why does it work from my home machine but not my bosses machine?? The file to upload is quite large - 7MB. -- http://mail.python.org/mailman/listinfo/python-list
CSV performance
I'm using the CSV library to process a large amount of data - 28 files, each of 130MB. Just reading in the data from one file and filing it into very simple data structures (numpy arrays and a cstringio) takes around 10 seconds. If I just slurp one file into a string, it only takes about a second, so I/O is not the bottleneck. Is it really taking 9 seconds just to split the lines and set the variables? Is there some way I can improve the CSV performance? Is there a way I can slurp the file into memory and read it like a file from there? Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: CSV performance
Thanks for your replies. Many apologies for not including the right information first time around. More information is below. I have tried running it just on the csv read: import time import csv afile = "largefile.txt" t0 = time.clock() print "working at file", afile reader = csv.reader(open(afile, "r"), delimiter="\t") for row in reader: x,y,z = row t1 = time.clock() print "finished: %f.2" % (t1 - t0) $ ./largefilespeedtest.py working at file largefile.txt finished: 3.86.2 A tiny bit of background on the final application: this is biological data from an affymetrix platform. The csv files are a chromosome name, a coordinate and a data point, like this: chr13754914 1.19828 chr13754950 1.56557 chr13754982 1.52371 In the "simple data structures" cod below, I do some jiggery pokery with the chromosome names to save me storing the same string millions of times. import csv import cStringIO import numpy import time afile = "largefile.txt" chrommap = {'chrY': 'y', 'chrX': 'x', 'chr13': 'c', 'chr12': 'b', 'chr11': 'a', 'chr10': '0', 'chr17': 'g', 'chr16': 'f', 'chr15': 'e', 'chr14': 'd', 'chr19': 'i', 'chr18': 'h', 'chrM': 'm', 'chr22': 'l', 'chr20': 'j', 'chr21': 'k', 'chr7': '7', 'chr6': '6', 'chr5': '5', 'chr4': '4', 'chr3': '3', 'chr2': '2', 'chr1': '1', 'chr9': '9', 'chr8': '8'} def getFileLength(fh): wholefile = fh.read() numlines = wholefile.count("\n") fh.seek(0) return numlines count = 0 print "reading affy file", afile fh = open(afile) n = getFileLength(fh) chromio = cStringIO.StringIO() coords = numpy.zeros(n, dtype=int) points = numpy.zeros(n) t0 = time.clock() reader = csv.reader(fh, delimiter="\t") for row in reader: if not row: continue chrom, coord, point = row mappedc = chrommap[chrom] chromio.write(mappedc) coords[count] = coord points[count] = point count += 1 t1 = time.clock() print "finished: %f.2" % (t1 - t0) $ ./affyspeedtest.py reading affy file largefile.txt finished: 15.54.2 Thanks again (tugs forelock), Peter -- http://mail.python.org/mailman/listinfo/python-list
Multiprocessing Pool and functions with many arguments
I'm trying to get to grips with the multiprocessing module, having only used ParallelPython before. based on this example: http://docs.python.org/library/multiprocessing.html#using-a-pool-of-workers what happens if I want my "f" to take more than one argument? I want to have a list of tuples of arguments and have these correspond the arguments in f, but it keeps complaining that I only have one argument (the tuple). Do I have to pass in a tuple and break it up inside f? I can't use multiple input lists, as I would with regular map. Thanks, Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: CSV performance
> > rows = fh.read().split() > coords = numpy.array(map(int, rows[1::3]), dtype=int) > points = numpy.array(map(float, rows[2::3]), dtype=float) > chromio.writelines(map(chrommap.__getitem__, rows[::3])) > My original version is about 15 seconds. This version is about 9. The chunks version posted by Scott is about 11 seconds with a chunk size of 16384. When integrated into the overall code, reading all 28 files, it improves the performance by about 30%. Many thanks to everybody for their help, Peter -- http://mail.python.org/mailman/listinfo/python-list
Overlapping region resolution
This may be an algorithmic question, but I'm trying to code it in Python, so... I have a list of pairwise regions, each with an integer start and end and a float data point. There may be overlaps between the regions. I want to resolve this into an ordered list with no overlapping regions. My initial solution was to sort the list by the start point, and then compare each adjacent region, clipping any overlapping section in half. I've attached code at the bottom. Unfortunately, this does not work well if you have sections that have three or more overlapping regions. A more general solution is to work out where all the overlaps are before I start. Then I can break up the region space based on what regions overlap each section and take averages of all the datapoints that are present in a particular section. Devising an algorithm to do this is making my brain hurt. Any ideas? Peter # also validates the data def clipRanges(regions): for i in range(len(regions) - 1): thispoint = regions[i] nextpoint = regions[i+1] assert thispoint[1] > thispoint[0] and nextpoint[1] > nextpoint[0], "point read not valid" thisend = thispoint[1] nextstart = nextpoint[0] diff = thisend - nextstart # a difference of zero is too close together if diff > -1: if diff % 2 == 1: diff += 1 correction = diff / 2 newend = thisend - correction newstart = newend + 1 assert newend > thispoint[0] and nextpoint[1] > newstart, "new range not valid!" newthispoint = (thispoint[0], newend, thispoint[2]) newnextpoint = (newstart, nextpoint[1], nextpoint[2]) regions[i] = newthispoint regions[i+1] = newnextpoint return regions regions = [ (0,10,2.5), (12,22,3.5), (15,25,1.2), (23, 30,0.01), (27, 37,1.23), (30, 35, 1.45) ] regions2 = [ (0,10,2.5), (1,11,1.1), (2,12,1.2) ] # works fine, produces [(0, 10, 2.5), (12, 18, 3.5), (19, 24, 1.2), (25, 28, 0.01), (29, 33, 1.23), (34, 35, 1.45)] print clipRanges(regions) # violates "new range not valid" assertion print clipRanges(regions2) -- http://mail.python.org/mailman/listinfo/python-list