Object cleanup

2012-05-30 Thread psaff...@googlemail.com
I am writing a screen scraping application using BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

(which is fantastic, by the way).

I have an object that has two methods, each of which loads an HTML document and 
scrapes out some information, putting strings from the HTML documents into 
lists and dictionaries. I have a set of these objects from which I am 
aggregating and returning data. 

With a large number of these objects, the memory footprint is very large. The 
"soup" object is a local variable to each scraping method, so I assumed it 
would be cleaned up after the method had returned.  However, I've found that 
using guppy, after the methods have returned most of the memory is being taken 
up with BeautifulSoup objects of one type or another. I'm not declaring 
BeautifulSoup objects anywhere else.

I've tried assigning None into the "soup" objects at the end of the method 
calls and calling garbage collection manually, but this doesn't seem to help. 
I'd like to find out exactly what object "owns" the various BeautifulSoup 
structures, but I'm quite a new guppy user and I can't figure out how to do 
this.

How do I force the memory for these soup objects to be freed? Is there antyhing 
else I should be looking at to find out the cause of these problems?

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Object cleanup

2012-05-31 Thread psaff...@googlemail.com
Thanks for all the responses.

It looks like none of the BeautifulSoup objects have __del__ methods, so I 
don't think that can be the problem.

To answer your other question, guppy was the best match I came up with when 
looking for a memory profile for Python (or more specifically "Heapy"):

http://guppy-pe.sourceforge.net/#Heapy

On Thursday, May 31, 2012 2:51:52 AM UTC+1, Steven D'Aprano wrote:
> 
> The destructor doesn't get called into the last reference is gone.
> 

That makes sense, so now I need to track down why there are references to the 
object when I don't think there should be. Are there any systematic methods for 
doing this?

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: mod_python: delay in files changing after alteration

2009-01-12 Thread psaff...@googlemail.com
On 6 Jan, 23:31, Graham Dumpleton  wrote:

> Thus, any changes to modules/packages installed on sys.path require a
> full restart of Apache to ensure they are loaded by all Apache child
> worker processes.
>

That will be it. I'm pulling in some libraries of my own from
elsewhere, which are still being modified to accommodate the web app.
These are the changes that are causing the problems. An Apache restart
isn't too onerous - I'll just start doing that.

Thanks,

Peter
--
http://mail.python.org/mailman/listinfo/python-list


subprocess.Popen stalls

2009-01-12 Thread psaff...@googlemail.com
I'm building a bioinformatics application using the ipcress tool:

http://www.ebi.ac.uk/~guy/exonerate/ipcress.man.html

I'm using subprocess.Popen to execute ipcress, which takes a group of
files full of DNA sequences and returns some analysis on them. Here's
a code fragment:

cmd = "/usr/bin/ipcress ipcresstmp.txt --sequence /home/pzs/genebuilds/
human/*.fasta"
print "checking with ipcress using command", cmd
p = Popen(cmd, shell=True, bufsize=100, stdout=PIPE, stderr=PIPE)
retcode = p.wait()
if retcode != 0:
print "ipcress failed with error code:", retcode
raise Exception
output = p.stdout.read()

If I run the command at my shell, it finishes successfully. It takes
30 seconds - it uses 100% of one core and several hundred MB of memory
during this time. The output is 220KB of text.

However, running it through Python as per the above code, it stalls
after 5 seconds not using any processor at all. I've tried leaving it
for a few minutes with no change. If I interrupt it, it's at the
"retcode = p.wait()" line.

I've tried making the bufsize really large and that doesn't seem to
help. I'm a bit stuck - any suggestions? This same command has worked
fine on other ipcress runs. This one might generate more output than
the others, but 220KB isn't that much, is it?

Peter

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: subprocess.Popen stalls

2009-01-12 Thread psaff...@googlemail.com
On 12 Jan, 15:33, mk  wrote:
>
> Better use communicate() method:
>

Oh yes - it's right there in the documentation. That worked perfectly.

Many thanks,

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Which core am I running on?

2009-02-09 Thread psaff...@googlemail.com
Is there some way I can get at this information at run-time? I'd like
to use it to tag diagnostic output dumped during runs using Parallel
Python.

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Which core am I running on?

2009-02-09 Thread psaff...@googlemail.com
On 9 Feb, 12:24, Gerhard Häring  wrote:
> Looks like I have answered a similar question once, btw. ;-)
>

Ah, yes - thanks. I did Google for it, but obviously didn't have the
right search term.

Cheers,

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Too many open files

2009-02-09 Thread psaff...@googlemail.com
I'm building a pipeline involving a number of shell tools. In each
case, I create a temporary file using tempfile.mkstmp() and invoke a
command ("cmd < /tmp/tmpfile") on it using subprocess.Popen.

At the end of each section, I call close() on the file handles and use
os.remove() to delete them. Even so I build up file descriptors that
eventually give me the "too many files" error message. Am I missing
something here?

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Which core am I running on?

2009-02-09 Thread psaff...@googlemail.com
On 9 Feb, 12:24, Gerhard Häring  wrote:
> http://objectmix.com/python/631346-parallel-python.html
>

Hmm. In fact, this doesn't seem to work for pp. When I run the code
below, it says everything is running on the one core.

import pp
import random
import time
from string import lowercase

ncpus = 3

def timedCharDump(waittime, char):
time.sleep(waittime)
mycore = open("/proc/%i/stat" % os.getpid()).read().split()[39]
print "I'm doing stuff!", mycore, char
return char

job_server = pp.Server(ncpus, ppservers=())

jobdetails = [ (random.random(), letter) for letter in lowercase ]

jobs = [ job_server.submit(timedCharDump,(jinput1, jinput2), (),
("os", "time",)) for jinput1, jinput2 in jobdetails ]

for job in jobs:
print job()


Peter
--
http://mail.python.org/mailman/listinfo/python-list


Selecting a different superclass

2008-12-17 Thread psaff...@googlemail.com
This might be a pure OO question, but I'm doing it in Python so I'll
ask here.

I'm writing a number crunching bioinformatics application. Read lots
of numbers from files; merge, median and munge; draw plots. I've found
that the most critical part of this work is validation and
traceability - "where does this final value come from? How has it been
combined with other values? Is that right?"

My current implementation stores all my values just as floats with a
class called PointSet for storing one set of such values, with various
mathematical and statistical methods. There are several subclasses of
PointSet (IDPointSet, MicroArrayPointSet) for obtaining values from
different file types and with different processing pipelines.

I'm planning to instead store each value in a TraceablePoint class
which has members that describe the processing stages this value has
undergone and a TraceablePointSet class to store groups of these -
this will contain all the same methods as PointSet, but will operate
on TraceablePoints instead of floats. Of course, this will be much
slower than just floats, so I'd like to be able to switch it on and
off.

The problem is that IDPointSet and MicroArrayPointSet will need to
inherit from PointSet or TraceablePointSet based on whether I'm
handling traceable points or not. Can I select a superclass
conditionally like this in Python? Am I trying to do something really
evil here?

Any other bright ideas on my application also welcome.

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Selecting a different superclass

2008-12-18 Thread psaff...@googlemail.com
On 17 Dec, 20:33, "Chris Rebert"  wrote:

> superclass = TraceablePointSet if tracing else PointSet
>

Perfect - many thanks. Good to know I'm absolved from evil, also ;)

Peter
--
http://mail.python.org/mailman/listinfo/python-list


mod_python: delay in files changing after alteration

2009-01-05 Thread psaff...@googlemail.com
Maybe this is an apache question, in which case apologies.

I am running mod_python 3.3.1-3 on apache 2.2.9-7. It works fine, but
I find that when I alter a source file during development, it
sometimes takes 5 seconds or so for the changes to be seen. This might
sound trivial, but when debugging tens of silly errors, it's annoying
that I have to keep hitting refresh on my browser waiting for the
change to "take". I'm guessing this is just a caching issue of some
kind, but can't figure out how to switch it off. Any suggestions?

The entry in my apache2.conf looks like this:


   SetHandler mod_python
   PythonHandler mod_python.publisher
   PythonDebug On



Thanks,

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Memory efficient tuple storage

2009-03-13 Thread psaff...@googlemail.com
I'm reading in some rather large files (28 files each of 130MB). Each
file is a genome coordinate (chromosome (string) and position (int))
and a data point (float). I want to read these into a list of
coordinates (each a tuple of (chromosome, position)) and a list of
data points.

This has taught me that Python lists are not memory efficient, because
if I use lists it gets through 100MB a second until it hits the swap
space and I have 8GB physical memory in this machine. I can use Python
or numpy arrays for the data points, which is much more manageable.
However, I still need the coordinates. If I don't keep them in a list,
where can I keep them?

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Memory efficient tuple storage

2009-03-13 Thread psaff...@googlemail.com
Thanks for all the replies.

First of all, can anybody recommend a good way to show memory usage? I
tried heapy, but couldn't make much sense of the output and it didn't
seem to change too much for different usages. Maybe I was just making
the h.heap() call in the wrong place. I also tried getrusage() in the
resource module. That seemed to give 0 for the shared and unshared
memory size no matter what I did. I was calling it after the function
call the filled up the lists. The memory figures I give in this
message come from top.

The numpy solution does work, but it uses more than 1GB of memory for
one of my 130MB files. I'm using

np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6',
'i4', 'f8']})

so shouldn't it use 18 bytes per line? The file has 5832443 lines,
which by my arithmetic is around 100MB...?

My previous solution - using a python array for the numbers and a list
of tuples for the coordinates uses about 900MB. The dictionary
solution suggested by Tim got this down to 650MB. If I just ignore the
coordinates, this comes down to less than 100MB. I feel sure the list
mechanics for storing the coordinates is what is killing me here.

As to "work smarter", you could be right, but it's tricky. The 28
files are in 4 groups of 7, so given that each file is about 6 million
lines, each group of data points contains about 42 million points.
First, I need to divide every point by the median of its group. Then I
need to z-score the whole group of points.

After this preparation, I need to file each point, based on its
coordinates, into other data structures - the genome itself is divided
up into bins that cover a range of coordinates, and we file each point
into the appropriate bin for the coordinate region it overlaps. Then
there operations that combine the values from various bins. The
relevant coordinates for these combinations come from more enormous
csv files. I've already done all this analysis on smaller datasets, so
I'm hoping I won't have to make huge changes just to fit the data into
memory. Yes, I'm also finding out how much it will cost to upgrade to
32GB of memory :)

Sorry for the long message...

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: Memory efficient tuple storage

2009-03-19 Thread psaff...@googlemail.com
In the end, I used a cStringIO object to store the chromosomes -
because there are only 23, I can use one character for each chromosome
and represent the whole lot with a giant string and a dictionary to
say what each character means. Then I used numpy arrays for the data
and coordinates. This squeezed each file into under 100MB.

Thanks again for the help!

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Parallel processing on shared data structures

2009-03-19 Thread psaff...@googlemail.com
I'm filing 160 million data points into a set of bins based on their
position. At the moment, this takes just over an hour using interval
trees. I would like to parallelise this to take advantage of my quad
core machine. I have some experience of Parallel Python, but PP seems
to only really work for problems where you can do one discrete bit of
processing and recombine these results at the end.

I guess I could thread my code and use mutexes to protect the shared
lists that everybody is filing into. However, my understanding is that
Python is still only using one process so this won't give me multi-
core.

Does anybody have any suggestions for this?

Peter
--
http://mail.python.org/mailman/listinfo/python-list


mod_python form upload: permission denied sometimes...

2009-04-24 Thread psaff...@googlemail.com
I have a mod_python application that takes a POST file upload from a
form. It works fine from my machine, other machines in my office and
my home machine. It does not work from my bosses machine in a
different city - he gets "You don't have permission to access this on
this server".

In the logs, it's returned 403. I also have this error in error.log:

Cannot traverse upload in /pythonapps/wiggle/form/upload because
 is not a traversable object,
referer: ...

Could this be a network level problem? If so, why does it work from my
home machine but not my bosses machine?? The file to upload is quite
large - 7MB.
--
http://mail.python.org/mailman/listinfo/python-list


CSV performance

2009-04-27 Thread psaff...@googlemail.com
I'm using the CSV library to process a large amount of data - 28
files, each of 130MB. Just reading in the data from one file and
filing it into very simple data structures (numpy arrays and a
cstringio) takes around 10 seconds. If I just slurp one file into a
string, it only takes about a second, so I/O is not the bottleneck. Is
it really taking 9 seconds just to split the lines and set the
variables?

Is there some way I can improve the CSV performance? Is there a way I
can slurp the file into memory and read it like a file from there?

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: CSV performance

2009-04-27 Thread psaff...@googlemail.com
Thanks for your replies. Many apologies for not including the right
information first time around. More information is below.

I have tried running it just on the csv read:

import time
import csv

afile = "largefile.txt"

t0 = time.clock()

print "working at file", afile
reader = csv.reader(open(afile, "r"), delimiter="\t")
for row in reader:
x,y,z = row


t1 = time.clock()

print "finished: %f.2" % (t1 - t0)


$ ./largefilespeedtest.py
working at file largefile.txt
finished: 3.86.2


A tiny bit of background on the final application: this is biological
data from an affymetrix platform. The csv files are a chromosome name,
a coordinate and a data point, like this:

chr13754914 1.19828
chr13754950 1.56557
chr13754982 1.52371

In the "simple data structures" cod below, I do some jiggery pokery
with the chromosome names to save me storing the same string millions
of times.


import csv
import cStringIO
import numpy
import time

afile = "largefile.txt"

chrommap = {'chrY': 'y', 'chrX': 'x', 'chr13': 'c',
'chr12': 'b', 'chr11': 'a', 'chr10': '0',
'chr17': 'g', 'chr16': 'f', 'chr15': 'e',
'chr14': 'd', 'chr19': 'i', 'chr18': 'h',
'chrM': 'm', 'chr22': 'l', 'chr20': 'j',
'chr21': 'k', 'chr7': '7', 'chr6': '6',
'chr5': '5', 'chr4': '4', 'chr3': '3',
'chr2': '2', 'chr1': '1', 'chr9': '9', 'chr8': '8'}


def getFileLength(fh):
wholefile = fh.read()
numlines = wholefile.count("\n")
fh.seek(0)
return numlines

count = 0
print "reading affy file", afile
fh = open(afile)
n = getFileLength(fh)
chromio = cStringIO.StringIO()
coords = numpy.zeros(n, dtype=int)
points = numpy.zeros(n)

t0 = time.clock()
reader = csv.reader(fh, delimiter="\t")
for row in reader:
if not row:
continue
chrom, coord, point = row
mappedc = chrommap[chrom]
chromio.write(mappedc)
coords[count] = coord
points[count] = point
count += 1
t1 = time.clock()

print "finished: %f.2" % (t1 - t0)


$ ./affyspeedtest.py
reading affy file largefile.txt
finished: 15.54.2


Thanks again (tugs forelock),

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Multiprocessing Pool and functions with many arguments

2009-04-29 Thread psaff...@googlemail.com
I'm trying to get to grips with the multiprocessing module, having
only used ParallelPython before.

based on this example:

http://docs.python.org/library/multiprocessing.html#using-a-pool-of-workers

what happens if I want my "f" to take more than one argument? I want
to have a list of tuples of arguments and have these correspond the
arguments in f, but it keeps complaining that I only have one argument
(the tuple). Do I have to pass in a tuple and break it up inside f? I
can't use multiple input lists, as I would with regular map.

Thanks,

Peter
--
http://mail.python.org/mailman/listinfo/python-list


Re: CSV performance

2009-04-29 Thread psaff...@googlemail.com
>
> rows = fh.read().split()
> coords = numpy.array(map(int, rows[1::3]), dtype=int)
> points = numpy.array(map(float, rows[2::3]), dtype=float)
> chromio.writelines(map(chrommap.__getitem__, rows[::3]))
>

My original version is about 15 seconds. This version is about 9. The
chunks version posted by Scott is about 11 seconds with a chunk size
of 16384.

When integrated into the overall code, reading all 28 files, it
improves the performance by about 30%.

Many thanks to everybody for their help,

Peter

--
http://mail.python.org/mailman/listinfo/python-list


Overlapping region resolution

2009-05-21 Thread psaff...@googlemail.com
This may be an algorithmic question, but I'm trying to code it in
Python, so...

I have a list of pairwise regions, each with an integer start and end
and a float data point. There may be overlaps between the regions. I
want to resolve this into an ordered list with no overlapping
regions.

My initial solution was to sort the list by the start point, and then
compare each adjacent region, clipping any overlapping section in
half. I've attached code at the bottom. Unfortunately, this does not
work well if you have sections that have three or more overlapping
regions.

A more general solution is to work out where all the overlaps are
before I start. Then I can break up the region space based on what
regions overlap each section and take averages of all the datapoints
that are present in a particular section. Devising an algorithm to do
this is making my brain hurt. Any ideas?

Peter


# also validates the data
def clipRanges(regions):
for i in range(len(regions) - 1):
thispoint = regions[i]
nextpoint = regions[i+1]
assert thispoint[1] > thispoint[0] and  nextpoint[1] > 
nextpoint[0],
"point read not valid"
thisend = thispoint[1]
nextstart = nextpoint[0]
diff = thisend - nextstart
# a difference of zero is too close together
if diff > -1:
if diff % 2 == 1:
diff += 1
correction = diff / 2
newend = thisend - correction
newstart = newend + 1
assert newend > thispoint[0] and nextpoint[1] > 
newstart, "new
range not valid!"
newthispoint = (thispoint[0], newend, thispoint[2])
newnextpoint = (newstart, nextpoint[1], nextpoint[2])
regions[i] = newthispoint
regions[i+1] = newnextpoint
return regions

regions = [ (0,10,2.5), (12,22,3.5), (15,25,1.2), (23, 30,0.01), (27,
37,1.23), (30, 35, 1.45) ]
regions2 = [ (0,10,2.5), (1,11,1.1), (2,12,1.2) ]

# works fine, produces [(0, 10, 2.5), (12, 18, 3.5), (19, 24, 1.2),
(25, 28, 0.01), (29, 33, 1.23), (34, 35, 1.45)]
print clipRanges(regions)
# violates "new range not valid" assertion
print clipRanges(regions2)
-- 
http://mail.python.org/mailman/listinfo/python-list