(question) How to use python get access to google search without query quota limit

2006-05-05 Thread Per
I am doing a Natural Language processing project for academic use,

I think google's rich retrieval information and query-segment might be
of help, I downloaded google api, but there is query limit(1000/day),
How can I write python code to simulate the browser-like-activity to
submit more than 10k queries in one day?

applying for more than 10 licence keys and changing them if
query-quota-exception raised is not a neat idea...

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (question) How to use python get access to google search without query quota limit

2006-05-05 Thread Per
Yeah, Thanks Am,

I can be considered as an advanced google user, presumably.. But I am
not a advanced programmer yet.

If everyone can generate unlimited number of queries, soon the
user-query-data, which I believe is google's most advantage, will be in
chaos. Can they simply ignore some queries from a certain licence key
or.. so that they can keep their user-query-statistics normal and yet
provide cranky queriers reseanable response?

-- 
http://mail.python.org/mailman/listinfo/python-list


setting PYTHONPATH to override system wide site-packages

2009-02-28 Thread per
hi all,

i recently installed a new version of a package using python setup.py
install --prefix=/my/homedir on a system where i don't have root
access. the old package still resides in /usr/lib/python2.5/site-
packages/ and i cannot erase it.

i set my python path as follows in ~/.cshrc

setenv PYTHONPATH /path/to/newpackage

but whenever i go to python and import the module, the version in site-
packages is loaded. how can i override this setting and make it so
python loads the version of the package that's in my home dir?

 thanks.


--
http://mail.python.org/mailman/listinfo/python-list


Re: setting PYTHONPATH to override system wide site-packages

2009-02-28 Thread per
On Feb 28, 11:24 pm, Carl Banks  wrote:
> On Feb 28, 7:30 pm, per  wrote:
>
> > hi all,
>
> > i recently installed a new version of a package using python setup.py
> > install --prefix=/my/homedir on a system where i don't have root
> > access. the old package still resides in /usr/lib/python2.5/site-
> > packages/ and i cannot erase it.
>
> > i set my python path as follows in ~/.cshrc
>
> > setenv PYTHONPATH /path/to/newpackage
>
> > but whenever i go to python and import the module, the version in site-
> > packages is loaded. how can i override this setting and make it so
> > python loads the version of the package that's in my home dir?
>
> What happens when you run the command "print sys.path" from the Python
> prompt?  /path/to/newpackage should be the second item, and shoud be
> listed in front of the site-packages dir.
>
> What happens when you run "print os.eviron['PYTHONPATH']" at the
> Python interpreter?  It's possible that the sysadmin installed a
> script that removes PYTHONPATH environment variable before invoking
> Python.  What happens when you type "which python" at the csh prompt?
>
> What happens when you type "ls /path/to/newpackage" at your csh
> prompt?  Is the module you're trying to import there?
>
> You approach should work.  These are just suggestions on how to
> diagnose the problem; we can't really help you figure out what's wrong
> without more information.
>
> Carl Banks

hi,

i am setting it programmatically now, using:

import sys
sys.path = []

sys.path now looks exactly like what it looked like before, except the
second element is my directory. yet when i do

import mymodule
print mymodule.__version__

i still get the old version...

any other ideas?
--
http://mail.python.org/mailman/listinfo/python-list


Re: setting PYTHONPATH to override system wide site-packages

2009-02-28 Thread per
On Feb 28, 11:53 pm, per  wrote:
> On Feb 28, 11:24 pm, Carl Banks  wrote:
>
>
>
> > On Feb 28, 7:30 pm, per  wrote:
>
> > > hi all,
>
> > > i recently installed a new version of a package using python setup.py
> > > install --prefix=/my/homedir on a system where i don't have root
> > > access. the old package still resides in /usr/lib/python2.5/site-
> > > packages/ and i cannot erase it.
>
> > > i set my python path as follows in ~/.cshrc
>
> > > setenv PYTHONPATH /path/to/newpackage
>
> > > but whenever i go to python and import the module, the version in site-
> > > packages is loaded. how can i override this setting and make it so
> > > python loads the version of the package that's in my home dir?
>
> > What happens when you run the command "print sys.path" from the Python
> > prompt?  /path/to/newpackage should be the second item, and shoud be
> > listed in front of the site-packages dir.
>
> > What happens when you run "print os.eviron['PYTHONPATH']" at the
> > Python interpreter?  It's possible that the sysadmin installed a
> > script that removes PYTHONPATH environment variable before invoking
> > Python.  What happens when you type "which python" at the csh prompt?
>
> > What happens when you type "ls /path/to/newpackage" at your csh
> > prompt?  Is the module you're trying to import there?
>
> > You approach should work.  These are just suggestions on how to
> > diagnose the problem; we can't really help you figure out what's wrong
> > without more information.
>
> > Carl Banks
>
> hi,
>
> i am setting it programmatically now, using:
>
> import sys
> sys.path = []
>
> sys.path now looks exactly like what it looked like before, except the
> second element is my directory. yet when i do
>
> import mymodule
> print mymodule.__version__
>
> i still get the old version...
>
> any other ideas?

in case it helps, it gives me this warning when i try to import the
module

/usr/lib64/python2.5/site-packages/pytz/__init__.py:29: UserWarning:
Module dateutil was already imported from /usr/lib64/python2.5/site-
packages/dateutil/__init__.pyc, but /usr/lib/python2.5/site-packages
is being added to sys.path
  from pkg_resources import resource_stream
--
http://mail.python.org/mailman/listinfo/python-list


speeding up reading files (possibly with cython)

2009-03-07 Thread per
hi all,

i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

for line in file:
  split_values = line.strip().split('\t')
  # do stuff with split_values

currently, this is very slow in python, even if all i do is break up
each line using split() and store its values in a dictionary, indexing
by one of the tab separated values in the file.

is this just an overhead of python that's inevitable? do you guys
think that switching to cython might speed this up, perhaps by
optimizing the main for loop?  or is this not a viable option?

thank you.
--
http://mail.python.org/mailman/listinfo/python-list


parsing tab separated data efficiently into numpy/pylab arrays

2009-03-13 Thread per
hi all,

what's the most efficient / preferred python way of parsing tab
separated data into arrays? for example if i have a file containing
two columns one corresponding to names the other numbers:

col1\t col 2
joe\t  12.3
jane   \t 155.0

i'd like to parse into an array() such that i can do: mydata[:, 0] and
mydata[:, 1] to easily access all the columns.

right now i can iterate through the file, parse it manually using the
split('\t') command and construct a list out of it, then convert it to
arrays. but there must be a better way?

also, my first column is just a name, and so it is variable in length
-- is there still a way to store it as an array so i can access: mydata
[:, 0] to get all the names (as a list)?

thank you.
--
http://mail.python.org/mailman/listinfo/python-list


loading program's global variables in ipython

2009-03-22 Thread per
hi all,

i have a file that declares some global variables, e.g.

myglobal1 = 'string'
myglobal2 = 5

and then some functions. i run it using ipython as follows:

[1] %run myfile.py

i notice then that myglobal1 and myglobal2 are not imported into
python's interactive namespace. i'd like them too -- how can i do
this?

 (note my file does not contain a __name__ == '__main__' clause.)

thanks.
--
http://mail.python.org/mailman/listinfo/python-list


splitting a large dictionary into smaller ones

2009-03-22 Thread per
hi all,

i have a very large dictionary object that is built from a text file
that is about 800 MB -- it contains several million keys.  ideally i
would like to pickle this object so that i wouldnt have to parse this
large file to compute the dictionary every time i run my program.
however currently the pickled file is over 300 MB and takes a very
long time to write to disk - even longer than recomputing the
dictionary from scratch.

i would like to split the dictionary into smaller ones, containing
only hundreds of thousands of keys, and then try to pickle them. is
there a way to easily do this? i.e. is there an easy way to make a
wrapper for this such that i can access this dictionary as just one
object, but underneath it's split into several? so that i can write
my_dict[k] and get a value, or set my_dict[m] to some value without
knowing which sub dictionary it's in.

if there aren't known ways to do this, i would greatly apprciate any
advice/examples on how to write this data structure from scratch,
reusing as much of the dict() class as possible.

thanks.

large_dict[a]
--
http://mail.python.org/mailman/listinfo/python-list


Re: splitting a large dictionary into smaller ones

2009-03-22 Thread per
On Mar 22, 10:51 pm, Paul Rubin <http://phr...@nospam.invalid> wrote:
> per  writes:
> > i would like to split the dictionary into smaller ones, containing
> > only hundreds of thousands of keys, and then try to pickle them.
>
> That already sounds like the wrong approach.  You want a database.

fair enough - what native python database would you recommend? i
prefer not to install anything commercial or anything other than
python modules
--
http://mail.python.org/mailman/listinfo/python-list


generating random tuples in python

2009-04-20 Thread per
hi all,

i am generating a list of random tuples of numbers between 0 and 1
using the rand() function, as follows:

for i in range(0, n):
  rand_tuple = (rand(), rand(), rand())
  mylist.append(rand_tuple)

when i generate this list, some of the random tuples might be
very close to each other, numerically. for example, i might get:

(0.553, 0.542, 0.654)

and

(0.581, 0.491, 0.634)

so the two tuples are close to each other in that all of their numbers
have similar magnitudes.

how can i maximize the amount of "numeric distance" between the
elements of
this list, but still make sure that all the tuples have numbers
strictly
between 0 and 1 (inclusive)?

in other words i want the list of random numbers to be arbitrarily
different (which is why i am using rand()) but as different from other
tuples in the list as possible.

thank you for your help

--
http://mail.python.org/mailman/listinfo/python-list


Re: generating random tuples in python

2009-04-20 Thread per
On Apr 20, 11:08 pm, Steven D'Aprano
 wrote:
> On Mon, 20 Apr 2009 11:39:35 -0700, per wrote:
> > hi all,
>
> > i am generating a list of random tuples of numbers between 0 and 1 using
> > the rand() function, as follows:
>
> > for i in range(0, n):
> >   rand_tuple = (rand(), rand(), rand()) mylist.append(rand_tuple)
>
> > when i generate this list, some of the random tuples might be very close
> > to each other, numerically. for example, i might get:
> [...]
> > how can i maximize the amount of "numeric distance" between the elements
> > of
> > this list, but still make sure that all the tuples have numbers strictly
> > between 0 and 1 (inclusive)?
>
> Well, the only way to *maximise* the distance between the elements is to
> set them to (0.0, 0.5, 1.0).
>
> > in other words i want the list of random numbers to be arbitrarily
> > different (which is why i am using rand()) but as different from other
> > tuples in the list as possible.
>
> That means that the numbers you are generating will no longer be
> uniformly distributed, they will be biased. That's okay, but you need to
> describe *how* you want them biased. What precisely do you mean by
> "maximizing the distance"?
>
> For example, here's one strategy: you need three random numbers, so
> divide the complete range 0-1 into three: generate three random numbers
> between 0 and 1/3.0, called x, y, z, and return [x, 1/3.0 + y, 2/3.0 + z].
>
> You might even decide to shuffle the list before returning them.
>
> But note that you might still happen to get (say) [0.332, 0.334, 0.668]
> or similar. That's the thing with randomness.
>
> --
> Steven

i realize my example in the original post was misleading. i dont want
to maximize the difference between individual members of a single
tuple -- i want to maximize the difference between distinct tuples. in
other words, it's ok to have (.332, .334, .38), as long as the other
tuple is, say, (.52, .6, .9) which is very difference from (.332, .
334, .38).  i want the member of a given tuple to be arbitrary, e.g.
something like (rand(), rand(), rand()) but that the tuples be very
different from each other.

to be more formal by very different, i would be happy if they were
maximally distant in ordinary euclidean space... so if you just plot
the 3-tuples on x, y, z i want them to all be very different from each
other.  i realize this is obviously biased and that the tuples are not
uniformly distributed -- that's exactly what i want...

any ideas on how to go about this?

thank you.
--
http://mail.python.org/mailman/listinfo/python-list


Is there such an idiom?

2006-03-19 Thread Per
http://jaynes.colorado.edu/PythonIdioms.html

"""Use dictionaries for searching, not lists. To find items in common
between two lists, make the first into a dictionary and then look for
items in the second in it. Searching a list for an item is linear-time,
while searching a dict for an item is constant time. This can often let
you reduce search time from quadratic to linear."""

Is this correct?
s = [1,2,3,4,5...]
t = [4,5,6,,8,...]
how to find whether there is/are common item(s) between two list in
linear-time?
how to find the number of common items between two list in linear-time?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is there such an idiom?

2006-03-19 Thread Per
Thanks Ron,
 surely set is the simplest way to understand the question, to see
whether there is a non-empty intersection. But I did the following
thing in a silly way, still not sure whether it is going to be linear
time.
def foo():
l = [...]
s = [...]
dic = {}
for i in l:
dic[i] = 0
k=0
while k http://mail.python.org/mailman/listinfo/python-list


creating pipelines in python

2009-11-22 Thread per
hi all,

i am looking for a python package to make it easier to create a
"pipeline" of scripts (all in python). what i do right now is have a
set of scripts that produce certain files as output, and i simply have
a "master" script that checks at each stage whether the output of the
previous script exists, using functions from the os module. this has
several flaws and i am sure someone has thought of nice abstractions
for making these kind of wrappers easier to write.

does anyone have any recommendations for python packages that can do
this?

thanks.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: creating pipelines in python

2009-11-25 Thread per
Thanks to all for your replies.  i want to clarify what i mean by a
pipeline.  a major feature i am looking for is the ability to chain
functions or scripts together, where the output of one script -- which
is usually a file -- is required for another script to run.  so one
script has to wait for the other.  i would like to do this over a
cluster, where some of the scripts are distributed as separate jobs on
a cluster but the results are then collected together.  so the ideal
library would have easily facilities for expressing this things:
script X and Y run independently, but script Z depends on the output
of X and Y (which is such and such file or file flag).

is there a way to do this? i prefer not to use a framework that
requires control of the clusters etc. like Disco, but something that's
light weight and simple. right now ruffus seems most relevant but i am
not sure -- are there other candidates?

thank you.

On Nov 23, 4:02 am, Paul Rudin  wrote:
> per  writes:
> > hi all,
>
> > i am looking for a python package to make it easier to create a
> > "pipeline" of scripts (all in python). what i do right now is have a
> > set of scripts that produce certain files as output, and i simply have
> > a "master" script that checks at each stage whether the output of the
> > previous script exists, using functions from the os module. this has
> > several flaws and i am sure someone has thought of nice abstractions
> > for making these kind of wrappers easier to write.
>
> > does anyone have any recommendations for python packages that can do
> > this?
>
> Not entirely what you're looking for, but the subprocess module is
> easier to work with for this sort of thing than os. See e.g. 
> <http://docs.python.org/library/subprocess.html#replacing-shell-pipeline>

-- 
http://mail.python.org/mailman/listinfo/python-list


fastest native python database?

2009-06-17 Thread per
hi all,

i'm looking for a native python package to run a very simple data
base. i was originally using cpickle with dictionaries for my problem,
but i was making dictionaries out of very large text files (around
1000MB in size) and pickling was simply too slow.

i am not looking for fancy SQL operations, just very simple data base
operations (doesn't have to be SQL style) and my preference is for a
module that just needs python and doesn't require me to run a separate
data base like Sybase or MySQL.

does anyone have any recommendations? the only candidates i've seen
are snaklesql and buzhug... any thoughts/benchmarks on these?

any info on this would be greatly appreciated. thank you
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: fastest native python database?

2009-06-17 Thread per
i would like to add to my previous post that if an option like SQLite
with a python interface (pysqlite) would be orders of magnitude faster
than naive python options, i'd prefer that. but if that's not the
case, a pure python solution without dependencies on other things
would be the best option.

thanks for the suggestion, will look into gadfly in the meantime.

On Jun 17, 11:38 pm, Emile van Sebille  wrote:
> On 6/17/2009 8:28 PM per said...
>
> > hi all,
>
> > i'm looking for a native python package to run a very simple data
> > base. i was originally using cpickle with dictionaries for my problem,
> > but i was making dictionaries out of very large text files (around
> > 1000MB in size) and pickling was simply too slow.
>
> > i am not looking for fancy SQL operations, just very simple data base
> > operations (doesn't have to be SQL style) and my preference is for a
> > module that just needs python and doesn't require me to run a separate
> > data base like Sybase or MySQL.
>
> You might like gadfly...
>
> http://gadfly.sourceforge.net/gadfly.html
>
> Emile
>
>
>
> > does anyone have any recommendations? the only candidates i've seen
> > are snaklesql and buzhug... any thoughts/benchmarks on these?
>
> > any info on this would be greatly appreciated. thank you
>
>

-- 
http://mail.python.org/mailman/listinfo/python-list


allowing output of code that is unittested?

2009-07-15 Thread per
hi all,

i am using the standard unittest module to unit test my code. my code
contains several print statements which i noticed are repressed when i
call my unit tests using:

if __name__ == '__main__':
suite = unittest.TestLoader().loadTestsFromTestCase(TestMyCode)
unittest.TextTestRunner(verbosity=2).run(suite)

is there a way to allow all the print statements in the code that is
being run by the unit test functions to be printed to stdio?  i want
to be able to see the output of the tested code, in addition to the
output of the unit testing framework.

thank you.
-- 
http://mail.python.org/mailman/listinfo/python-list


efficiently splitting up strings based on substrings

2009-09-05 Thread per
I'm trying to efficiently "split" strings based on what substrings
they are made up of.
i have a set of strings that are comprised of known substrings.
For example, a, b, and c are substrings that are not identical to each
other, e.g.:
a = "0" * 5
b = "1" * 5
c = "2" * 5

Then my_string might be:

my_string = a + b + c

i am looking for an efficient way to solve the following problem.
suppose i have a short
string x that is a substring of my_string.  I want to "split" the
string x into blocks based on
what substrings (i.e. a, b, or c) chunks of s fall into.

to illustrate this, suppose x = "00111". Then I can detect where x
starts in my_string
using my_string.find(x).  But I don't know how to partition x into
blocks depending
on the substrings.  What I want to get out in this case is: "00",
"111".  If x were "00122",
I'd want to get out "00","1", "22".

is there an easy way to do this?  i can't simply split x on a, b, or c
because these might
not be contained in x.  I want to avoid doing something inefficient
like looking at all substrings
of my_string etc.

i wouldn't mind using regular expressions for this but i cannot think
of an easy regular
expression for this problem.  I looked at the string module in the
library but did not see
anything that seemd related but i might have missed it.

any help on this would be greatly appreciated.  thanks.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: efficiently splitting up strings based on substrings

2009-09-05 Thread per
On Sep 5, 6:42 pm, "Rhodri James"  wrote:
> On Sat, 05 Sep 2009 22:54:41 +0100, per  wrote:
> > I'm trying to efficiently "split" strings based on what substrings
> > they are made up of.
> > i have a set of strings that are comprised of known substrings.
> > For example, a, b, and c are substrings that are not identical to each
> > other, e.g.:
> > a = "0" * 5
> > b = "1" * 5
> > c = "2" * 5
>
> > Then my_string might be:
>
> > my_string = a + b + c
>
> > i am looking for an efficient way to solve the following problem.
> > suppose i have a short
> > string x that is a substring of my_string.  I want to "split" the
> > string x into blocks based on
> > what substrings (i.e. a, b, or c) chunks of s fall into.
>
> > to illustrate this, suppose x = "00111". Then I can detect where x
> > starts in my_string
> > using my_string.find(x).  But I don't know how to partition x into
> > blocks depending
> > on the substrings.  What I want to get out in this case is: "00",
> > "111".  If x were "00122",
> > I'd want to get out "00","1", "22".
>
> > is there an easy way to do this?  i can't simply split x on a, b, or c
> > because these might
> > not be contained in x.  I want to avoid doing something inefficient
> > like looking at all substrings
> > of my_string etc.
>
> > i wouldn't mind using regular expressions for this but i cannot think
> > of an easy regular
> > expression for this problem.  I looked at the string module in the
> > library but did not see
> > anything that seemd related but i might have missed it.
>
> I'm not sure I understand your question exactly.  You seem to imply
> that the order of the substrings of x is consistent.  If that's the
> case, this ought to help:
>
> >>> import re
> >>> x = "00122"
> >>> m = re.match(r"(0*)(1*)(2*)", x)
> >>> m.groups()
>
> ('00', '1', '22')>>> y = "00111"
> >>> m = re.match(r"(0*)(1*)(2*)", y)
> >>> m.groups()
>
> ('00', '111', '')
>
> You'll have to filter out the empty groups for yourself, but that's
> no great problem.
>
> --
> Rhodri James *-* Wildebeest Herder to the Masses

The order of the substrings is consistent but what if it's not 0, 1, 2
but a more complicated string? e.g.

a = 1030405, b = 1babcf, c = fUUIUP

then the substring x might be 4051ba, in which case using a regexp
with (1*) will not work since both a and b substrings begin with the
character 1.

your solution works if that weren't a possibility, so what you wrote
is definitely the kind of solution i am looking for. i am just not
sure how to solve it in the general case where the substrings might be
similar to each other (but not similar enough that you can't tell
where the substring came from).

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: efficiently splitting up strings based on substrings

2009-09-05 Thread per
On Sep 5, 7:07 pm, "Rhodri James"  wrote:
> On Sat, 05 Sep 2009 23:54:08 +0100, per  wrote:
> > On Sep 5, 6:42 pm, "Rhodri James"  wrote:
> >> On Sat, 05 Sep 2009 22:54:41 +0100, per  wrote:
> >> > I'm trying to efficiently "split" strings based on what substrings
> >> > they are made up of.
> >> > i have a set of strings that are comprised of known substrings.
> >> > For example, a, b, and c are substrings that are not identical to each
> >> > other, e.g.:
> >> > a = "0" * 5
> >> > b = "1" * 5
> >> > c = "2" * 5
>
> >> > Then my_string might be:
>
> >> > my_string = a + b + c
>
> >> > i am looking for an efficient way to solve the following problem.
> >> > suppose i have a short
> >> > string x that is a substring of my_string.  I want to "split" the
> >> > string x into blocks based on
> >> > what substrings (i.e. a, b, or c) chunks of s fall into.
>
> >> > to illustrate this, suppose x = "00111". Then I can detect where x
> >> > starts in my_string
> >> > using my_string.find(x).  But I don't know how to partition x into
> >> > blocks depending
> >> > on the substrings.  What I want to get out in this case is: "00",
> >> > "111".  If x were "00122",
> >> > I'd want to get out "00","1", "22".
>
> >> > is there an easy way to do this?  i can't simply split x on a, b, or c
> >> > because these might
> >> > not be contained in x.  I want to avoid doing something inefficient
> >> > like looking at all substrings
> >> > of my_string etc.
>
> >> > i wouldn't mind using regular expressions for this but i cannot think
> >> > of an easy regular
> >> > expression for this problem.  I looked at the string module in the
> >> > library but did not see
> >> > anything that seemd related but i might have missed it.
>
> >> I'm not sure I understand your question exactly.  You seem to imply
> >> that the order of the substrings of x is consistent.  If that's the
> >> case, this ought to help:
>
> >> >>> import re
> >> >>> x = "00122"
> >> >>> m = re.match(r"(0*)(1*)(2*)", x)
> >> >>> m.groups()
>
> >> ('00', '1', '22')>>> y = "00111"
> >> >>> m = re.match(r"(0*)(1*)(2*)", y)
> >> >>> m.groups()
>
> >> ('00', '111', '')
>
> >> You'll have to filter out the empty groups for yourself, but that's
> >> no great problem.
>
> > The order of the substrings is consistent but what if it's not 0, 1, 2
> > but a more complicated string? e.g.
>
> > a = 1030405, b = 1babcf, c = fUUIUP
>
> > then the substring x might be 4051ba, in which case using a regexp
> > with (1*) will not work since both a and b substrings begin with the
> > character 1.
>
> Right.  This looks approximately nothing like what I thought your
> problem was.  Would I be right in thinking that you want to match
> substrings of your potential "substrings" against the string x?
>
> I'm sufficiently confused that I think I'd like to see what your
> use case actually is before I make more of a fool of myself.
>
> --
> Rhodri James *-* Wildebeest Herder to the Masses

it's exactly the same problem, except there are no constraints on the
strings.  so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c.  if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.
-- 
http://mail.python.org/mailman/listinfo/python-list


serial port server cnhd38

2005-02-22 Thread per . bergstrom
To whom it may concern,
The serial port server 'cnhd38' has been terminated (on who's
initiative, I don't know).
It affects the users of the (at least) following nodes:
cnhd36, cnhd44, cnhd45, cnhd46, cnhd47.
The new terminal server to use is called 'msp-t01'. The port
numbers that are of interest for the nodes mentioned above are
as follows:
port 17: this port is shared between:
  cnhd44/etm4 serial port (via riscwatch), currently connected here.
  cnhd36/console port
port 18: this port goes to cnhd44/console port
port 19: this port goes to cnhd45/console port
port 20: this port goes to cnhd47/console port
port 21: this port goes to cnhd46/console port
To connect to a port, just enter the following command:
telnet msp-t01 
... an extra  should give you the prompt.
 is always 20
 is the port number...
example, connect to cnhd47/console port:
telnet msp-t01 2020
br
/Per
--
http://mail.python.org/mailman/listinfo/python-list


Re: Program eating memory, but only on one machine?

2007-01-22 Thread Per B.Sederberg

Wolfgang Draxinger  darkstargames.de> writes:
> 
> > So, does anyone have any suggestions for how I can debug this
> > problem?
> 
> Have a look at the version numbers of the GCC used. Probably
> something in your C code fails if it interacts with GCC 3.x.x.
> It's hardly Python eating memory, this is probably your C
> module. GC won't help here, since then you must add this into
> your C module.
> 
> >  If my program ate up memory on all machines, then I would know
> > where to start and would blame some horrible programming on my
> > end. This just seems like a less straightforward problem.
> 
> GCC 3.x.x brings other runtime libs, than GCC 4.x.x, I would
> check into that direction.
> 

Thank you for the suggestions.  Since my C module is such a small part of the
simulations, I can just comment out the call to that module completely (though I
am still loading it) and fill in what the results would have been with random
values.  Sadly, the program still eats up memory on our cluster.

Still, it could be something related to compiling Python with the older GCC.

I'll see if I can make a really small example program that eats up memory on our
cluster.  That way we'll have something easy to work with.

Thanks,
Per



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Program eating memory, but only on one machine? (Solved, sort of)

2007-01-22 Thread Per B.Sederberg
Per B.Sederberg  princeton.edu> writes:

> I'll see if I can make a really small example program that eats up memory on
> our cluster.  That way we'll have something easy to work with.

Now this is weird.  I figured out the bug and it turned out that every time you
call numpy.setmember1d in the latest stable release of numpy it was using up a
ton of memory and never releasing it.

I replaced every instance of setmember1d with my own method below and I have
zero increase in memory.  It's not the most efficient of code, but it gets the
job done...


def ismember(a,b):
ainb = zeros(len(a),dtype=bool)
for item in b:
ainb = ainb | (a==item)
return ainb

I'll now go post this problem on the numpy forums.

Best,
Per




-- 
http://mail.python.org/mailman/listinfo/python-list


efficient interval containment lookup

2009-01-12 Thread Per Freem
hello,

suppose I have two lists of intervals, one significantly larger than
the other.
For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
thousands
of elements while listB (of the same form) might contain hundreds of
thousands
or millions of elements.
I want to count how many intervals in listB are contained within every
listA. For example, if listA = [(10, 30), (600, 800)] and listB =
[(20, 25), (12, 18)] is the input, then the output should be that (10,
30) has 2 intervals from listB contained within it, while (600, 800)
has 0. (Elements of listB can be contained within many intervals in
listA, not just one.)

What is an efficient way to this?  One simple way is:

for a_range in listA:
  for b_range in listB:
is_within(b_range, a_range):
  # accumulate a counter here

where is_within simply checks if the first argument is within the
second.

I'm not sure if it's more efficient to have the iteration over listA
be on the outside or listB.  But perhaps there's a way to index this
that makes things more efficient?  I.e. a smart way of indexing listA
such that I can instantly get all of its elements that are within some
element
of listB, maybe?  Something like a hash, where this look up can be
close to constant time rather than an iteration over all lists... if
there's any built-in library functions that can help in this it would
be great.

any suggestions on this would be awesome. thank you.
--
http://mail.python.org/mailman/listinfo/python-list


Re: efficient interval containment lookup

2009-01-12 Thread Per Freem
thanks for your replies -- a few clarifications and questions. the
is_within operation is containment, i.e. (a,b) is within (c,d) iff a
>= c and b <= d. Note that I am not looking for intervals that
overlap... this is why interval trees seem to me to not be relevant,
as the overlapping interval problem is way harder than what I am
trying to do. Please correct me if I'm wrong on this...

Scott Daniels, I was hoping you could elaborate on your comment about
bisect. I am trying to use it as follows: I try to grid my space
(since my intervals have an upper and lower bound) into segments (e.g.
of 100) and then I take these "bins" and put them into a bisect list,
so that it is sorted. Then when a new interval comes in, I try to
place it within one of those bins.  But this is getting messy: I don't
know if I should place it there by its beginning number or end
number.  Also, if I have an interval that overlaps my boundaries --
i.e. (900, 1010) when my first interval is (0, 1000), I may miss some
items from listB when i make my count.  Is there an elegant solution
to this?  Gridding like you said seemed straight forward but now it
seems complicated..

I'd like to add that this is *not* a homework problem, by the way.

On Jan 12, 4:05 pm, Robert Kern  wrote:
> [Apologies for piggybacking, but I think GMane had a hiccup today and missed 
> the
> original post]
>
> [Somebody wrote]:
>
> >> suppose I have two lists of intervals, one significantly larger than
> >> the other.
> >> For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain
> >> thousands
> >> of elements while listB (of the same form) might contain hundreds of
> >> thousands
> >> or millions of elements.
> >> I want to count how many intervals in listB are contained within every
> >> listA. For example, if listA = [(10, 30), (600, 800)] and listB =
> >> [(20, 25), (12, 18)] is the input, then the output should be that (10,
> >> 30) has 2 intervals from listB contained within it, while (600, 800)
> >> has 0. (Elements of listB can be contained within many intervals in
> >> listA, not just one.)
>
> Interval trees.
>
> http://en.wikipedia.org/wiki/Interval_tree
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless enigma
>   that is made terrible by our own mad attempt to interpret it as though it 
> had
>   an underlying truth."
>    -- Umberto Eco

--
http://mail.python.org/mailman/listinfo/python-list


Re: efficient interval containment lookup

2009-01-12 Thread Per Freem
On Jan 12, 10:58 pm, Steven D'Aprano
 wrote:
> On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
> > thanks for your replies -- a few clarifications and questions. the
> > is_within operation is containment, i.e. (a,b) is within (c,d) iff a
> >>= c and b <= d. Note that I am not looking for intervals that
> > overlap... this is why interval trees seem to me to not be relevant, as
> > the overlapping interval problem is way harder than what I am trying to
> > do. Please correct me if I'm wrong on this...
>
> To test for contained intervals:
> a >= c and b <= d
>
> To test for overlapping intervals:
>
> not (b < c or a > d)
>
> Not exactly what I would call "way harder".
>
> --
> Steven

hi Steven,

i found an implementation (which is exactly how i'd write it based on
the description) here: 
http://hackmap.blogspot.com/2008/11/python-interval-tree.html

when i use this however, it comes out either significantly slower or
equal to a naive search. my naive search just iterates through a
smallish list of intervals and for each one says whether they overlap
with each of a large set of intervals.

here is the exact code i used to make the comparison, plus the code at
the link i have above:

class Interval():
def __init__(self, start, stop):
self.start = start
self.stop = stop

import random
import time
num_ints = 3
init_intervals = []
for n in range(0,
num_ints):
start = int(round(random.random()
*1000))
end = start + int(round(random.random()*500+1))
init_intervals.append(Interval(start, end))
num_ranges = 900
ranges = []
for n in range(0, num_ranges):
  start = int(round(random.random()
*1000))
  end = start + int(round(random.random()*500+1))
  ranges.append((start, end))
#print init_intervals
tree = IntervalTree(init_intervals)
t1 = time.time()
for r in ranges:
  tree.find(r[0], r[1])
t2 = time.time()
print "interval tree: %.3f" %((t2-t1)*1000.0)
t1 = time.time()
for r in ranges:
  naive_find(init_intervals, r[0], r[1])
t2 = time.time()
print "brute force: %.3f" %((t2-t1)*1000.0)

on one run, i get:
interval tree: 8584.682
brute force: 8201.644

is there anything wrong with this implementation? it seems very right
to me but i am no expert. any help on this would be relly helpful.
--
http://mail.python.org/mailman/listinfo/python-list


Re: efficient interval containment lookup

2009-01-12 Thread Per Freem
i forgot to add, my naive_find is:

def naive_find(intervals, start, stop):
  results = []
  for interval in intervals:
if interval.start >= start and interval.stop <= stop:
  results.append(interval)
  return results

On Jan 12, 11:55 pm, Per Freem  wrote:
> On Jan 12, 10:58 pm, Steven D'Aprano
>
>
>
>  wrote:
> > On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
> > > thanks for your replies -- a few clarifications and questions. the
> > > is_within operation is containment, i.e. (a,b) is within (c,d) iff a
> > >>= c and b <= d. Note that I am not looking for intervals that
> > > overlap... this is why interval trees seem to me to not be relevant, as
> > > the overlapping interval problem is way harder than what I am trying to
> > > do. Please correct me if I'm wrong on this...
>
> > To test for contained intervals:
> > a >= c and b <= d
>
> > To test for overlapping intervals:
>
> > not (b < c or a > d)
>
> > Not exactly what I would call "way harder".
>
> > --
> > Steven
>
> hi Steven,
>
> i found an implementation (which is exactly how i'd write it based on
> the description) 
> here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html
>
> when i use this however, it comes out either significantly slower or
> equal to a naive search. my naive search just iterates through a
> smallish list of intervals and for each one says whether they overlap
> with each of a large set of intervals.
>
> here is the exact code i used to make the comparison, plus the code at
> the link i have above:
>
> class Interval():
>     def __init__(self, start, stop):
>         self.start = start
>         self.stop = stop
>
> import random
> import time
> num_ints = 3
> init_intervals = []
> for n in range(0,
> num_ints):
>     start = int(round(random.random()
> *1000))
>     end = start + int(round(random.random()*500+1))
>     init_intervals.append(Interval(start, end))
> num_ranges = 900
> ranges = []
> for n in range(0, num_ranges):
>   start = int(round(random.random()
> *1000))
>   end = start + int(round(random.random()*500+1))
>   ranges.append((start, end))
> #print init_intervals
> tree = IntervalTree(init_intervals)
> t1 = time.time()
> for r in ranges:
>   tree.find(r[0], r[1])
> t2 = time.time()
> print "interval tree: %.3f" %((t2-t1)*1000.0)
> t1 = time.time()
> for r in ranges:
>   naive_find(init_intervals, r[0], r[1])
> t2 = time.time()
> print "brute force: %.3f" %((t2-t1)*1000.0)
>
> on one run, i get:
> interval tree: 8584.682
> brute force: 8201.644
>
> is there anything wrong with this implementation? it seems very right
> to me but i am no expert. any help on this would be relly helpful.

--
http://mail.python.org/mailman/listinfo/python-list


Re: efficient interval containment lookup

2009-01-12 Thread Per Freem
hi brent, thanks very much for your informative reply -- didn't
realize this about the size of the interval.

thanks for the bx-python link.  could you (or someone else) explain
why the size of the interval makes such a big difference? i don't
understand why it affects efficiency so much...

thanks.

On Jan 13, 12:24 am, brent  wrote:
> On Jan 12, 8:55 pm, Per Freem  wrote:
>
>
>
> > On Jan 12, 10:58 pm, Steven D'Aprano
>
> >  wrote:
> > > On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote:
> > > > thanks for your replies -- a few clarifications and questions. the
> > > > is_within operation is containment, i.e. (a,b) is within (c,d) iff a
> > > >>= c and b <= d. Note that I am not looking for intervals that
> > > > overlap... this is why interval trees seem to me to not be relevant, as
> > > > the overlapping interval problem is way harder than what I am trying to
> > > > do. Please correct me if I'm wrong on this...
>
> > > To test for contained intervals:
> > > a >= c and b <= d
>
> > > To test for overlapping intervals:
>
> > > not (b < c or a > d)
>
> > > Not exactly what I would call "way harder".
>
> > > --
> > > Steven
>
> > hi Steven,
>
> > i found an implementation (which is exactly how i'd write it based on
> > the description) 
> > here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html
>
> > when i use this however, it comes out either significantly slower or
> > equal to a naive search. my naive search just iterates through a
> > smallish list of intervals and for each one says whether they overlap
> > with each of a large set of intervals.
>
> > here is the exact code i used to make the comparison, plus the code at
> > the link i have above:
>
> > class Interval():
> >     def __init__(self, start, stop):
> >         self.start = start
> >         self.stop = stop
>
> > import random
> > import time
> > num_ints = 3
> > init_intervals = []
> > for n in range(0,
> > num_ints):
> >     start = int(round(random.random()
> > *1000))
> >     end = start + int(round(random.random()*500+1))
> >     init_intervals.append(Interval(start, end))
> > num_ranges = 900
> > ranges = []
> > for n in range(0, num_ranges):
> >   start = int(round(random.random()
> > *1000))
> >   end = start + int(round(random.random()*500+1))
> >   ranges.append((start, end))
> > #print init_intervals
> > tree = IntervalTree(init_intervals)
> > t1 = time.time()
> > for r in ranges:
> >   tree.find(r[0], r[1])
> > t2 = time.time()
> > print "interval tree: %.3f" %((t2-t1)*1000.0)
> > t1 = time.time()
> > for r in ranges:
> >   naive_find(init_intervals, r[0], r[1])
> > t2 = time.time()
> > print "brute force: %.3f" %((t2-t1)*1000.0)
>
> > on one run, i get:
> > interval tree: 8584.682
> > brute force: 8201.644
>
> > is there anything wrong with this implementation? it seems very right
> > to me but i am no expert. any help on this would be relly helpful.
>
> hi, the tree is inefficient when the interval is large. as the size of
> the interval shrinks to much less than the expanse of the tree, the
> tree will be faster. changing 500 to 50 in both cases in your script,
> i get:
> interval tree: 3233.404
> brute force: 9807.787
>
> so the tree will work for limited cases. but it's quite simple. check
> the tree in 
> bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera...
> for a more robust implementation.
> -brentp


--
http://mail.python.org/mailman/listinfo/python-list


optimizing large dictionaries

2009-01-15 Thread Per Freem
hello

i have an optimization questions about python. i am iterating through
a file and counting the number of repeated elements. the file has on
the order
of tens of millions elements...

i create a dictionary that maps elements of the file that i want to
count
to their number of occurs. so i iterate through the file and for each
line
extract the elements (simple text operation) and see if it has an
entry in the dict:

for line in file:
  try:
elt = MyClass(line)# extract elt from line...
my_dict[elt] += 1
  except KeyError:
my_dict[elt] = 1

i am using try/except since it is supposedly faster (though i am not
sure
about this? is this really true in Python 2.5?).

the only 'twist' is that my elt is an instance of a class (MyClass)
with 3 fields, all numeric. the class is hashable, and so my_dict[elt]
works well.
the __repr__ and __hash__ methods of my class simply return str()
representation
of self, while __str__ just makes everything numeric field into a
concatenated string:

class MyClass

  def __str__(self):
return "%s-%s-%s" %(self.field1, self.field2, self.field3)

  def __repr__(self):
return str(self)

  def __hash__(self):
return hash(str(self))


is there anything that can be done to speed up this simply code? right
now it is taking well over 15 minutes to process, on a 3 Ghz machine
with lots of RAM (though this is all taking CPU power, not RAM at this
point.)

any general advice on how to optimize large dicts would be great too

thanks for your help.
--
http://mail.python.org/mailman/listinfo/python-list


Re: optimizing large dictionaries

2009-01-15 Thread Per Freem
thanks to everyone for the excellent suggestions. a few follow up q's:

1] is Try-Except really slower? my dict actually has two layers, so
my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where
as the bKeys are the ones that are in the millions.  so in that case,
doing a Try-Except on aKey should be very efficient, since often it
will not fail, where as if I do: "if aKey in my_dict", that statement
will get executed for each aKey. can someone definitely say whether
Try-Except is faster or not? My benchmarks aren't conclusive and i
hear it both ways from several people (though majority thinks
TryExcept is faster).

2] is there an easy way to have nested defaultdicts? ie i want to say
that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact
that my_dict is a dictionary, whose values are dictionary that map to
ints. but that syntax is not valid.

3] more importantly, is there likely to be a big improvement for
splitting up one big dictionary into several smaller ones? if so, is
there a straight forward elegant way to implement this? the way i am
thinking is to just fix a number of dicts and populate them with
elements. then during retrieval, try the first dict, if that fails,
try the second, if not the third, etc... but i can imagine how that's
more likely to lead to bugs / debugging give the way my code is setup
so i am wondering whether it is really worth it.
if it can lead to a factor of 2 difference, i will definitely
implement it -- does anyone have experience with this?

On Jan 15, 5:58 pm, Steven D'Aprano  wrote:
> On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote:
> >> is there anything that can be done to speed up this simply code? right
> >> now it is taking well over 15 minutes to process, on a 3 Ghz machine
> >> with lots of RAM (though this is all taking CPU power, not RAM at this
> >> point.)
>
> > class MyClass(object):
> >     # a new style class with slots saves some memory
> >     __slots__ = ("field1", "field2", "field2")
>
> I was curious whether using slots would speed up attribute access.
>
> >>> class Parrot(object):
>
> ...     def __init__(self, a, b, c):
> ...             self.a = a
> ...             self.b = b
> ...             self.c = c
> ...>>> class SlottedParrot(object):
>
> ...     __slots__ = 'a', 'b', 'c'
> ...     def __init__(self, a, b, c):
> ...             self.a = a
> ...             self.b = b
> ...             self.c = c
> ...
>
> >>> p = Parrot(23, "something", [1, 2, 3])
> >>> sp = SlottedParrot(23, "something", [1, 2, 3])
>
> >>> from timeit import Timer
> >>> setup = "from __main__ import p, sp"
> >>> t1 = Timer('p.a, p.b, p.c', setup)
> >>> t2 = Timer('sp.a, sp.b, sp.c', setup)
> >>> min(t1.repeat())
> 0.83308887481689453
> >>> min(t2.repeat())
>
> 0.62758088111877441
>
> That's not a bad improvement. I knew that __slots__ was designed to
> reduce memory consumption, but I didn't realise they were faster as well.
>
> --
> Steven

--
http://mail.python.org/mailman/listinfo/python-list


test

2004-12-23 Thread Per Erik Stendahl
sdfdsafasd
--
http://mail.python.org/mailman/listinfo/python-list


Program eating memory, but only on one machine?

2007-01-22 Thread Per B. Sederberg
Hi Everybody:

I'm having a difficult time figuring out a a memory use problem.  I
have a python program that makes use of numpy and also calls a small C
module I wrote because part of the simulation needed to loop and I got
a massive speedup by putting that loop in C.  I'm basically
manipulating a bunch of matrices, so nothing too fancy.

That aside, when the simulation runs, it typically uses a relatively
small amount of memory (about 1.5% of my 4GB of RAM on my linux
desktop) and this never increases.  It can run for days without
increasing beyond this, running many many parameter set iterations.
This is what happens both on my Ubuntu Linux machine with the
following Python specs:

Python 2.4.4c1 (#2, Oct 11 2006, 20:00:03)
[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.version.version
'1.0rc1'

and also on my Apple MacBook with the following Python specs:

Python 2.4.3 (#1, Apr  7 2006, 10:54:33)
[GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.version.version
'1.0.1.dev3435'
>>>


Well, that is the case on two of my test machines, but not on the one
machine that I really wish would work, my lab's cluster, which would
give me 20-fold increase in the number of processes I could run.  On
that machine, each process is using 2GB of RAM after about 1 hour (and
the cluster MOM eventually kills them).  I can watch the process eat
RAM at each iteration and never relinquish it.  Here's the Python spec
of the cluster:

Python 2.4.4 (#1, Jan 21 2007, 12:09:48)
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-49)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.version.version
'1.0.1'

It also showed the same issue with the April 2006 2.4.3 release of python.

I have tried using the gc module to force garbage collection after
each iteration, but no change.  I've done many newsgroup/google
searches looking for known issues, but none found.  The only major
difference I can see is that our cluster is stuck on a really old
version of gcc with the RedHat Enterprise that's on there, but I found
no suggestions of memory issues online.

So, does anyone have any suggestions for how I can debug this problem?
 If my program ate up memory on all machines, then I would know where
to start and would blame some horrible programming on my end.  This
just seems like a less straightforward problem.

Thanks for any help,
Per
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: listdir reports [Error 1006] The volume for a file has been externally altered so that the opened file is no longer valid

2009-01-08 Thread Per Olav Kroka
FYI: the '/*.*' is part of the error message returned. 

-Original Message-
From: ch...@rebertia.com [mailto:ch...@rebertia.com] On Behalf Of Chris
Rebert
Sent: Wednesday, January 07, 2009 6:40 PM
To: Per Olav Kroka
Cc: python-list@python.org
Subject: Re: listdir reports [Error 1006] The volume for a file has been
externally altered so that the opened file is no longer valid

> PS: Why does the listdir() function add '*.*' to the path?

Don't know what you're talking about. It doesn't do any globbing or add
"*.*" to the path. Its exclusive purpose is to list the contents of a
directory, so /in a sense/ it does add "*.*", but then not adding "*.*"
would make the function completely useless given its purpose.

> PS2: Why does the listdir() function add '/*.*' to the path on windows

> and not '\\*.*' ?

You can use either directory separator (\ or /) with the Python APIs on
Windows. r"c:\WINDOWS\" works just as well as "c:/WINDOWS/".

Cheers,
Chris

--
Follow the path of the Iguana...
http://rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list