(question) How to use python get access to google search without query quota limit
I am doing a Natural Language processing project for academic use, I think google's rich retrieval information and query-segment might be of help, I downloaded google api, but there is query limit(1000/day), How can I write python code to simulate the browser-like-activity to submit more than 10k queries in one day? applying for more than 10 licence keys and changing them if query-quota-exception raised is not a neat idea... -- http://mail.python.org/mailman/listinfo/python-list
Re: (question) How to use python get access to google search without query quota limit
Yeah, Thanks Am, I can be considered as an advanced google user, presumably.. But I am not a advanced programmer yet. If everyone can generate unlimited number of queries, soon the user-query-data, which I believe is google's most advantage, will be in chaos. Can they simply ignore some queries from a certain licence key or.. so that they can keep their user-query-statistics normal and yet provide cranky queriers reseanable response? -- http://mail.python.org/mailman/listinfo/python-list
setting PYTHONPATH to override system wide site-packages
hi all, i recently installed a new version of a package using python setup.py install --prefix=/my/homedir on a system where i don't have root access. the old package still resides in /usr/lib/python2.5/site- packages/ and i cannot erase it. i set my python path as follows in ~/.cshrc setenv PYTHONPATH /path/to/newpackage but whenever i go to python and import the module, the version in site- packages is loaded. how can i override this setting and make it so python loads the version of the package that's in my home dir? thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: setting PYTHONPATH to override system wide site-packages
On Feb 28, 11:24 pm, Carl Banks wrote: > On Feb 28, 7:30 pm, per wrote: > > > hi all, > > > i recently installed a new version of a package using python setup.py > > install --prefix=/my/homedir on a system where i don't have root > > access. the old package still resides in /usr/lib/python2.5/site- > > packages/ and i cannot erase it. > > > i set my python path as follows in ~/.cshrc > > > setenv PYTHONPATH /path/to/newpackage > > > but whenever i go to python and import the module, the version in site- > > packages is loaded. how can i override this setting and make it so > > python loads the version of the package that's in my home dir? > > What happens when you run the command "print sys.path" from the Python > prompt? /path/to/newpackage should be the second item, and shoud be > listed in front of the site-packages dir. > > What happens when you run "print os.eviron['PYTHONPATH']" at the > Python interpreter? It's possible that the sysadmin installed a > script that removes PYTHONPATH environment variable before invoking > Python. What happens when you type "which python" at the csh prompt? > > What happens when you type "ls /path/to/newpackage" at your csh > prompt? Is the module you're trying to import there? > > You approach should work. These are just suggestions on how to > diagnose the problem; we can't really help you figure out what's wrong > without more information. > > Carl Banks hi, i am setting it programmatically now, using: import sys sys.path = [] sys.path now looks exactly like what it looked like before, except the second element is my directory. yet when i do import mymodule print mymodule.__version__ i still get the old version... any other ideas? -- http://mail.python.org/mailman/listinfo/python-list
Re: setting PYTHONPATH to override system wide site-packages
On Feb 28, 11:53 pm, per wrote: > On Feb 28, 11:24 pm, Carl Banks wrote: > > > > > On Feb 28, 7:30 pm, per wrote: > > > > hi all, > > > > i recently installed a new version of a package using python setup.py > > > install --prefix=/my/homedir on a system where i don't have root > > > access. the old package still resides in /usr/lib/python2.5/site- > > > packages/ and i cannot erase it. > > > > i set my python path as follows in ~/.cshrc > > > > setenv PYTHONPATH /path/to/newpackage > > > > but whenever i go to python and import the module, the version in site- > > > packages is loaded. how can i override this setting and make it so > > > python loads the version of the package that's in my home dir? > > > What happens when you run the command "print sys.path" from the Python > > prompt? /path/to/newpackage should be the second item, and shoud be > > listed in front of the site-packages dir. > > > What happens when you run "print os.eviron['PYTHONPATH']" at the > > Python interpreter? It's possible that the sysadmin installed a > > script that removes PYTHONPATH environment variable before invoking > > Python. What happens when you type "which python" at the csh prompt? > > > What happens when you type "ls /path/to/newpackage" at your csh > > prompt? Is the module you're trying to import there? > > > You approach should work. These are just suggestions on how to > > diagnose the problem; we can't really help you figure out what's wrong > > without more information. > > > Carl Banks > > hi, > > i am setting it programmatically now, using: > > import sys > sys.path = [] > > sys.path now looks exactly like what it looked like before, except the > second element is my directory. yet when i do > > import mymodule > print mymodule.__version__ > > i still get the old version... > > any other ideas? in case it helps, it gives me this warning when i try to import the module /usr/lib64/python2.5/site-packages/pytz/__init__.py:29: UserWarning: Module dateutil was already imported from /usr/lib64/python2.5/site- packages/dateutil/__init__.pyc, but /usr/lib/python2.5/site-packages is being added to sys.path from pkg_resources import resource_stream -- http://mail.python.org/mailman/listinfo/python-list
speeding up reading files (possibly with cython)
hi all, i have a program that essentially loops through a textfile file thats about 800 MB in size containing tab separated data... my program parses this file and stores its fields in a dictionary of lists. for line in file: split_values = line.strip().split('\t') # do stuff with split_values currently, this is very slow in python, even if all i do is break up each line using split() and store its values in a dictionary, indexing by one of the tab separated values in the file. is this just an overhead of python that's inevitable? do you guys think that switching to cython might speed this up, perhaps by optimizing the main for loop? or is this not a viable option? thank you. -- http://mail.python.org/mailman/listinfo/python-list
parsing tab separated data efficiently into numpy/pylab arrays
hi all, what's the most efficient / preferred python way of parsing tab separated data into arrays? for example if i have a file containing two columns one corresponding to names the other numbers: col1\t col 2 joe\t 12.3 jane \t 155.0 i'd like to parse into an array() such that i can do: mydata[:, 0] and mydata[:, 1] to easily access all the columns. right now i can iterate through the file, parse it manually using the split('\t') command and construct a list out of it, then convert it to arrays. but there must be a better way? also, my first column is just a name, and so it is variable in length -- is there still a way to store it as an array so i can access: mydata [:, 0] to get all the names (as a list)? thank you. -- http://mail.python.org/mailman/listinfo/python-list
loading program's global variables in ipython
hi all, i have a file that declares some global variables, e.g. myglobal1 = 'string' myglobal2 = 5 and then some functions. i run it using ipython as follows: [1] %run myfile.py i notice then that myglobal1 and myglobal2 are not imported into python's interactive namespace. i'd like them too -- how can i do this? (note my file does not contain a __name__ == '__main__' clause.) thanks. -- http://mail.python.org/mailman/listinfo/python-list
splitting a large dictionary into smaller ones
hi all, i have a very large dictionary object that is built from a text file that is about 800 MB -- it contains several million keys. ideally i would like to pickle this object so that i wouldnt have to parse this large file to compute the dictionary every time i run my program. however currently the pickled file is over 300 MB and takes a very long time to write to disk - even longer than recomputing the dictionary from scratch. i would like to split the dictionary into smaller ones, containing only hundreds of thousands of keys, and then try to pickle them. is there a way to easily do this? i.e. is there an easy way to make a wrapper for this such that i can access this dictionary as just one object, but underneath it's split into several? so that i can write my_dict[k] and get a value, or set my_dict[m] to some value without knowing which sub dictionary it's in. if there aren't known ways to do this, i would greatly apprciate any advice/examples on how to write this data structure from scratch, reusing as much of the dict() class as possible. thanks. large_dict[a] -- http://mail.python.org/mailman/listinfo/python-list
Re: splitting a large dictionary into smaller ones
On Mar 22, 10:51 pm, Paul Rubin <http://phr...@nospam.invalid> wrote: > per writes: > > i would like to split the dictionary into smaller ones, containing > > only hundreds of thousands of keys, and then try to pickle them. > > That already sounds like the wrong approach. You want a database. fair enough - what native python database would you recommend? i prefer not to install anything commercial or anything other than python modules -- http://mail.python.org/mailman/listinfo/python-list
generating random tuples in python
hi all, i am generating a list of random tuples of numbers between 0 and 1 using the rand() function, as follows: for i in range(0, n): rand_tuple = (rand(), rand(), rand()) mylist.append(rand_tuple) when i generate this list, some of the random tuples might be very close to each other, numerically. for example, i might get: (0.553, 0.542, 0.654) and (0.581, 0.491, 0.634) so the two tuples are close to each other in that all of their numbers have similar magnitudes. how can i maximize the amount of "numeric distance" between the elements of this list, but still make sure that all the tuples have numbers strictly between 0 and 1 (inclusive)? in other words i want the list of random numbers to be arbitrarily different (which is why i am using rand()) but as different from other tuples in the list as possible. thank you for your help -- http://mail.python.org/mailman/listinfo/python-list
Re: generating random tuples in python
On Apr 20, 11:08 pm, Steven D'Aprano wrote: > On Mon, 20 Apr 2009 11:39:35 -0700, per wrote: > > hi all, > > > i am generating a list of random tuples of numbers between 0 and 1 using > > the rand() function, as follows: > > > for i in range(0, n): > > rand_tuple = (rand(), rand(), rand()) mylist.append(rand_tuple) > > > when i generate this list, some of the random tuples might be very close > > to each other, numerically. for example, i might get: > [...] > > how can i maximize the amount of "numeric distance" between the elements > > of > > this list, but still make sure that all the tuples have numbers strictly > > between 0 and 1 (inclusive)? > > Well, the only way to *maximise* the distance between the elements is to > set them to (0.0, 0.5, 1.0). > > > in other words i want the list of random numbers to be arbitrarily > > different (which is why i am using rand()) but as different from other > > tuples in the list as possible. > > That means that the numbers you are generating will no longer be > uniformly distributed, they will be biased. That's okay, but you need to > describe *how* you want them biased. What precisely do you mean by > "maximizing the distance"? > > For example, here's one strategy: you need three random numbers, so > divide the complete range 0-1 into three: generate three random numbers > between 0 and 1/3.0, called x, y, z, and return [x, 1/3.0 + y, 2/3.0 + z]. > > You might even decide to shuffle the list before returning them. > > But note that you might still happen to get (say) [0.332, 0.334, 0.668] > or similar. That's the thing with randomness. > > -- > Steven i realize my example in the original post was misleading. i dont want to maximize the difference between individual members of a single tuple -- i want to maximize the difference between distinct tuples. in other words, it's ok to have (.332, .334, .38), as long as the other tuple is, say, (.52, .6, .9) which is very difference from (.332, . 334, .38). i want the member of a given tuple to be arbitrary, e.g. something like (rand(), rand(), rand()) but that the tuples be very different from each other. to be more formal by very different, i would be happy if they were maximally distant in ordinary euclidean space... so if you just plot the 3-tuples on x, y, z i want them to all be very different from each other. i realize this is obviously biased and that the tuples are not uniformly distributed -- that's exactly what i want... any ideas on how to go about this? thank you. -- http://mail.python.org/mailman/listinfo/python-list
Is there such an idiom?
http://jaynes.colorado.edu/PythonIdioms.html """Use dictionaries for searching, not lists. To find items in common between two lists, make the first into a dictionary and then look for items in the second in it. Searching a list for an item is linear-time, while searching a dict for an item is constant time. This can often let you reduce search time from quadratic to linear.""" Is this correct? s = [1,2,3,4,5...] t = [4,5,6,,8,...] how to find whether there is/are common item(s) between two list in linear-time? how to find the number of common items between two list in linear-time? -- http://mail.python.org/mailman/listinfo/python-list
Re: Is there such an idiom?
Thanks Ron, surely set is the simplest way to understand the question, to see whether there is a non-empty intersection. But I did the following thing in a silly way, still not sure whether it is going to be linear time. def foo(): l = [...] s = [...] dic = {} for i in l: dic[i] = 0 k=0 while k http://mail.python.org/mailman/listinfo/python-list
creating pipelines in python
hi all, i am looking for a python package to make it easier to create a "pipeline" of scripts (all in python). what i do right now is have a set of scripts that produce certain files as output, and i simply have a "master" script that checks at each stage whether the output of the previous script exists, using functions from the os module. this has several flaws and i am sure someone has thought of nice abstractions for making these kind of wrappers easier to write. does anyone have any recommendations for python packages that can do this? thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: creating pipelines in python
Thanks to all for your replies. i want to clarify what i mean by a pipeline. a major feature i am looking for is the ability to chain functions or scripts together, where the output of one script -- which is usually a file -- is required for another script to run. so one script has to wait for the other. i would like to do this over a cluster, where some of the scripts are distributed as separate jobs on a cluster but the results are then collected together. so the ideal library would have easily facilities for expressing this things: script X and Y run independently, but script Z depends on the output of X and Y (which is such and such file or file flag). is there a way to do this? i prefer not to use a framework that requires control of the clusters etc. like Disco, but something that's light weight and simple. right now ruffus seems most relevant but i am not sure -- are there other candidates? thank you. On Nov 23, 4:02 am, Paul Rudin wrote: > per writes: > > hi all, > > > i am looking for a python package to make it easier to create a > > "pipeline" of scripts (all in python). what i do right now is have a > > set of scripts that produce certain files as output, and i simply have > > a "master" script that checks at each stage whether the output of the > > previous script exists, using functions from the os module. this has > > several flaws and i am sure someone has thought of nice abstractions > > for making these kind of wrappers easier to write. > > > does anyone have any recommendations for python packages that can do > > this? > > Not entirely what you're looking for, but the subprocess module is > easier to work with for this sort of thing than os. See e.g. > <http://docs.python.org/library/subprocess.html#replacing-shell-pipeline> -- http://mail.python.org/mailman/listinfo/python-list
fastest native python database?
hi all, i'm looking for a native python package to run a very simple data base. i was originally using cpickle with dictionaries for my problem, but i was making dictionaries out of very large text files (around 1000MB in size) and pickling was simply too slow. i am not looking for fancy SQL operations, just very simple data base operations (doesn't have to be SQL style) and my preference is for a module that just needs python and doesn't require me to run a separate data base like Sybase or MySQL. does anyone have any recommendations? the only candidates i've seen are snaklesql and buzhug... any thoughts/benchmarks on these? any info on this would be greatly appreciated. thank you -- http://mail.python.org/mailman/listinfo/python-list
Re: fastest native python database?
i would like to add to my previous post that if an option like SQLite with a python interface (pysqlite) would be orders of magnitude faster than naive python options, i'd prefer that. but if that's not the case, a pure python solution without dependencies on other things would be the best option. thanks for the suggestion, will look into gadfly in the meantime. On Jun 17, 11:38 pm, Emile van Sebille wrote: > On 6/17/2009 8:28 PM per said... > > > hi all, > > > i'm looking for a native python package to run a very simple data > > base. i was originally using cpickle with dictionaries for my problem, > > but i was making dictionaries out of very large text files (around > > 1000MB in size) and pickling was simply too slow. > > > i am not looking for fancy SQL operations, just very simple data base > > operations (doesn't have to be SQL style) and my preference is for a > > module that just needs python and doesn't require me to run a separate > > data base like Sybase or MySQL. > > You might like gadfly... > > http://gadfly.sourceforge.net/gadfly.html > > Emile > > > > > does anyone have any recommendations? the only candidates i've seen > > are snaklesql and buzhug... any thoughts/benchmarks on these? > > > any info on this would be greatly appreciated. thank you > > -- http://mail.python.org/mailman/listinfo/python-list
allowing output of code that is unittested?
hi all, i am using the standard unittest module to unit test my code. my code contains several print statements which i noticed are repressed when i call my unit tests using: if __name__ == '__main__': suite = unittest.TestLoader().loadTestsFromTestCase(TestMyCode) unittest.TextTestRunner(verbosity=2).run(suite) is there a way to allow all the print statements in the code that is being run by the unit test functions to be printed to stdio? i want to be able to see the output of the tested code, in addition to the output of the unit testing framework. thank you. -- http://mail.python.org/mailman/listinfo/python-list
efficiently splitting up strings based on substrings
I'm trying to efficiently "split" strings based on what substrings they are made up of. i have a set of strings that are comprised of known substrings. For example, a, b, and c are substrings that are not identical to each other, e.g.: a = "0" * 5 b = "1" * 5 c = "2" * 5 Then my_string might be: my_string = a + b + c i am looking for an efficient way to solve the following problem. suppose i have a short string x that is a substring of my_string. I want to "split" the string x into blocks based on what substrings (i.e. a, b, or c) chunks of s fall into. to illustrate this, suppose x = "00111". Then I can detect where x starts in my_string using my_string.find(x). But I don't know how to partition x into blocks depending on the substrings. What I want to get out in this case is: "00", "111". If x were "00122", I'd want to get out "00","1", "22". is there an easy way to do this? i can't simply split x on a, b, or c because these might not be contained in x. I want to avoid doing something inefficient like looking at all substrings of my_string etc. i wouldn't mind using regular expressions for this but i cannot think of an easy regular expression for this problem. I looked at the string module in the library but did not see anything that seemd related but i might have missed it. any help on this would be greatly appreciated. thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: efficiently splitting up strings based on substrings
On Sep 5, 6:42 pm, "Rhodri James" wrote: > On Sat, 05 Sep 2009 22:54:41 +0100, per wrote: > > I'm trying to efficiently "split" strings based on what substrings > > they are made up of. > > i have a set of strings that are comprised of known substrings. > > For example, a, b, and c are substrings that are not identical to each > > other, e.g.: > > a = "0" * 5 > > b = "1" * 5 > > c = "2" * 5 > > > Then my_string might be: > > > my_string = a + b + c > > > i am looking for an efficient way to solve the following problem. > > suppose i have a short > > string x that is a substring of my_string. I want to "split" the > > string x into blocks based on > > what substrings (i.e. a, b, or c) chunks of s fall into. > > > to illustrate this, suppose x = "00111". Then I can detect where x > > starts in my_string > > using my_string.find(x). But I don't know how to partition x into > > blocks depending > > on the substrings. What I want to get out in this case is: "00", > > "111". If x were "00122", > > I'd want to get out "00","1", "22". > > > is there an easy way to do this? i can't simply split x on a, b, or c > > because these might > > not be contained in x. I want to avoid doing something inefficient > > like looking at all substrings > > of my_string etc. > > > i wouldn't mind using regular expressions for this but i cannot think > > of an easy regular > > expression for this problem. I looked at the string module in the > > library but did not see > > anything that seemd related but i might have missed it. > > I'm not sure I understand your question exactly. You seem to imply > that the order of the substrings of x is consistent. If that's the > case, this ought to help: > > >>> import re > >>> x = "00122" > >>> m = re.match(r"(0*)(1*)(2*)", x) > >>> m.groups() > > ('00', '1', '22')>>> y = "00111" > >>> m = re.match(r"(0*)(1*)(2*)", y) > >>> m.groups() > > ('00', '111', '') > > You'll have to filter out the empty groups for yourself, but that's > no great problem. > > -- > Rhodri James *-* Wildebeest Herder to the Masses The order of the substrings is consistent but what if it's not 0, 1, 2 but a more complicated string? e.g. a = 1030405, b = 1babcf, c = fUUIUP then the substring x might be 4051ba, in which case using a regexp with (1*) will not work since both a and b substrings begin with the character 1. your solution works if that weren't a possibility, so what you wrote is definitely the kind of solution i am looking for. i am just not sure how to solve it in the general case where the substrings might be similar to each other (but not similar enough that you can't tell where the substring came from). -- http://mail.python.org/mailman/listinfo/python-list
Re: efficiently splitting up strings based on substrings
On Sep 5, 7:07 pm, "Rhodri James" wrote: > On Sat, 05 Sep 2009 23:54:08 +0100, per wrote: > > On Sep 5, 6:42 pm, "Rhodri James" wrote: > >> On Sat, 05 Sep 2009 22:54:41 +0100, per wrote: > >> > I'm trying to efficiently "split" strings based on what substrings > >> > they are made up of. > >> > i have a set of strings that are comprised of known substrings. > >> > For example, a, b, and c are substrings that are not identical to each > >> > other, e.g.: > >> > a = "0" * 5 > >> > b = "1" * 5 > >> > c = "2" * 5 > > >> > Then my_string might be: > > >> > my_string = a + b + c > > >> > i am looking for an efficient way to solve the following problem. > >> > suppose i have a short > >> > string x that is a substring of my_string. I want to "split" the > >> > string x into blocks based on > >> > what substrings (i.e. a, b, or c) chunks of s fall into. > > >> > to illustrate this, suppose x = "00111". Then I can detect where x > >> > starts in my_string > >> > using my_string.find(x). But I don't know how to partition x into > >> > blocks depending > >> > on the substrings. What I want to get out in this case is: "00", > >> > "111". If x were "00122", > >> > I'd want to get out "00","1", "22". > > >> > is there an easy way to do this? i can't simply split x on a, b, or c > >> > because these might > >> > not be contained in x. I want to avoid doing something inefficient > >> > like looking at all substrings > >> > of my_string etc. > > >> > i wouldn't mind using regular expressions for this but i cannot think > >> > of an easy regular > >> > expression for this problem. I looked at the string module in the > >> > library but did not see > >> > anything that seemd related but i might have missed it. > > >> I'm not sure I understand your question exactly. You seem to imply > >> that the order of the substrings of x is consistent. If that's the > >> case, this ought to help: > > >> >>> import re > >> >>> x = "00122" > >> >>> m = re.match(r"(0*)(1*)(2*)", x) > >> >>> m.groups() > > >> ('00', '1', '22')>>> y = "00111" > >> >>> m = re.match(r"(0*)(1*)(2*)", y) > >> >>> m.groups() > > >> ('00', '111', '') > > >> You'll have to filter out the empty groups for yourself, but that's > >> no great problem. > > > The order of the substrings is consistent but what if it's not 0, 1, 2 > > but a more complicated string? e.g. > > > a = 1030405, b = 1babcf, c = fUUIUP > > > then the substring x might be 4051ba, in which case using a regexp > > with (1*) will not work since both a and b substrings begin with the > > character 1. > > Right. This looks approximately nothing like what I thought your > problem was. Would I be right in thinking that you want to match > substrings of your potential "substrings" against the string x? > > I'm sufficiently confused that I think I'd like to see what your > use case actually is before I make more of a fool of myself. > > -- > Rhodri James *-* Wildebeest Herder to the Masses it's exactly the same problem, except there are no constraints on the strings. so the problem is, like you say, matching the substrings against the string x. in other words, finding out where x "aligns" to the ordered substrings abc, and then determine what chunk of x belongs to a, what chunk belongs to b, and what chunk belongs to c. so in the example i gave above, the substrings are: a = 1030405, b = 1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP given a substring like 4051ba, i'd want to split it into the chunks a, b, and c. in this case, i'd want the result to be: ["405", "1ba"] -- i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk that belongs to be. in this case, there are no chunks of c. if x instead were "4051babcffUU", the right output is: ["405", "1babcf", "fUU"], which are the corresponding chunks of a, b, and c that make up x respectively. i'm not sure how to approach this. any ideas/tips would be greatly appreciated. thanks again. -- http://mail.python.org/mailman/listinfo/python-list
serial port server cnhd38
To whom it may concern, The serial port server 'cnhd38' has been terminated (on who's initiative, I don't know). It affects the users of the (at least) following nodes: cnhd36, cnhd44, cnhd45, cnhd46, cnhd47. The new terminal server to use is called 'msp-t01'. The port numbers that are of interest for the nodes mentioned above are as follows: port 17: this port is shared between: cnhd44/etm4 serial port (via riscwatch), currently connected here. cnhd36/console port port 18: this port goes to cnhd44/console port port 19: this port goes to cnhd45/console port port 20: this port goes to cnhd47/console port port 21: this port goes to cnhd46/console port To connect to a port, just enter the following command: telnet msp-t01 ... an extra should give you the prompt. is always 20 is the port number... example, connect to cnhd47/console port: telnet msp-t01 2020 br /Per -- http://mail.python.org/mailman/listinfo/python-list
Re: Program eating memory, but only on one machine?
Wolfgang Draxinger darkstargames.de> writes: > > > So, does anyone have any suggestions for how I can debug this > > problem? > > Have a look at the version numbers of the GCC used. Probably > something in your C code fails if it interacts with GCC 3.x.x. > It's hardly Python eating memory, this is probably your C > module. GC won't help here, since then you must add this into > your C module. > > > If my program ate up memory on all machines, then I would know > > where to start and would blame some horrible programming on my > > end. This just seems like a less straightforward problem. > > GCC 3.x.x brings other runtime libs, than GCC 4.x.x, I would > check into that direction. > Thank you for the suggestions. Since my C module is such a small part of the simulations, I can just comment out the call to that module completely (though I am still loading it) and fill in what the results would have been with random values. Sadly, the program still eats up memory on our cluster. Still, it could be something related to compiling Python with the older GCC. I'll see if I can make a really small example program that eats up memory on our cluster. That way we'll have something easy to work with. Thanks, Per -- http://mail.python.org/mailman/listinfo/python-list
Re: Program eating memory, but only on one machine? (Solved, sort of)
Per B.Sederberg princeton.edu> writes: > I'll see if I can make a really small example program that eats up memory on > our cluster. That way we'll have something easy to work with. Now this is weird. I figured out the bug and it turned out that every time you call numpy.setmember1d in the latest stable release of numpy it was using up a ton of memory and never releasing it. I replaced every instance of setmember1d with my own method below and I have zero increase in memory. It's not the most efficient of code, but it gets the job done... def ismember(a,b): ainb = zeros(len(a),dtype=bool) for item in b: ainb = ainb | (a==item) return ainb I'll now go post this problem on the numpy forums. Best, Per -- http://mail.python.org/mailman/listinfo/python-list
efficient interval containment lookup
hello, suppose I have two lists of intervals, one significantly larger than the other. For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain thousands of elements while listB (of the same form) might contain hundreds of thousands or millions of elements. I want to count how many intervals in listB are contained within every listA. For example, if listA = [(10, 30), (600, 800)] and listB = [(20, 25), (12, 18)] is the input, then the output should be that (10, 30) has 2 intervals from listB contained within it, while (600, 800) has 0. (Elements of listB can be contained within many intervals in listA, not just one.) What is an efficient way to this? One simple way is: for a_range in listA: for b_range in listB: is_within(b_range, a_range): # accumulate a counter here where is_within simply checks if the first argument is within the second. I'm not sure if it's more efficient to have the iteration over listA be on the outside or listB. But perhaps there's a way to index this that makes things more efficient? I.e. a smart way of indexing listA such that I can instantly get all of its elements that are within some element of listB, maybe? Something like a hash, where this look up can be close to constant time rather than an iteration over all lists... if there's any built-in library functions that can help in this it would be great. any suggestions on this would be awesome. thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: efficient interval containment lookup
thanks for your replies -- a few clarifications and questions. the is_within operation is containment, i.e. (a,b) is within (c,d) iff a >= c and b <= d. Note that I am not looking for intervals that overlap... this is why interval trees seem to me to not be relevant, as the overlapping interval problem is way harder than what I am trying to do. Please correct me if I'm wrong on this... Scott Daniels, I was hoping you could elaborate on your comment about bisect. I am trying to use it as follows: I try to grid my space (since my intervals have an upper and lower bound) into segments (e.g. of 100) and then I take these "bins" and put them into a bisect list, so that it is sorted. Then when a new interval comes in, I try to place it within one of those bins. But this is getting messy: I don't know if I should place it there by its beginning number or end number. Also, if I have an interval that overlaps my boundaries -- i.e. (900, 1010) when my first interval is (0, 1000), I may miss some items from listB when i make my count. Is there an elegant solution to this? Gridding like you said seemed straight forward but now it seems complicated.. I'd like to add that this is *not* a homework problem, by the way. On Jan 12, 4:05 pm, Robert Kern wrote: > [Apologies for piggybacking, but I think GMane had a hiccup today and missed > the > original post] > > [Somebody wrote]: > > >> suppose I have two lists of intervals, one significantly larger than > >> the other. > >> For example listA = [(10, 30), (5, 25), (100, 200), ...] might contain > >> thousands > >> of elements while listB (of the same form) might contain hundreds of > >> thousands > >> or millions of elements. > >> I want to count how many intervals in listB are contained within every > >> listA. For example, if listA = [(10, 30), (600, 800)] and listB = > >> [(20, 25), (12, 18)] is the input, then the output should be that (10, > >> 30) has 2 intervals from listB contained within it, while (600, 800) > >> has 0. (Elements of listB can be contained within many intervals in > >> listA, not just one.) > > Interval trees. > > http://en.wikipedia.org/wiki/Interval_tree > > -- > Robert Kern > > "I have come to believe that the whole world is an enigma, a harmless enigma > that is made terrible by our own mad attempt to interpret it as though it > had > an underlying truth." > -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: efficient interval containment lookup
On Jan 12, 10:58 pm, Steven D'Aprano wrote: > On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote: > > thanks for your replies -- a few clarifications and questions. the > > is_within operation is containment, i.e. (a,b) is within (c,d) iff a > >>= c and b <= d. Note that I am not looking for intervals that > > overlap... this is why interval trees seem to me to not be relevant, as > > the overlapping interval problem is way harder than what I am trying to > > do. Please correct me if I'm wrong on this... > > To test for contained intervals: > a >= c and b <= d > > To test for overlapping intervals: > > not (b < c or a > d) > > Not exactly what I would call "way harder". > > -- > Steven hi Steven, i found an implementation (which is exactly how i'd write it based on the description) here: http://hackmap.blogspot.com/2008/11/python-interval-tree.html when i use this however, it comes out either significantly slower or equal to a naive search. my naive search just iterates through a smallish list of intervals and for each one says whether they overlap with each of a large set of intervals. here is the exact code i used to make the comparison, plus the code at the link i have above: class Interval(): def __init__(self, start, stop): self.start = start self.stop = stop import random import time num_ints = 3 init_intervals = [] for n in range(0, num_ints): start = int(round(random.random() *1000)) end = start + int(round(random.random()*500+1)) init_intervals.append(Interval(start, end)) num_ranges = 900 ranges = [] for n in range(0, num_ranges): start = int(round(random.random() *1000)) end = start + int(round(random.random()*500+1)) ranges.append((start, end)) #print init_intervals tree = IntervalTree(init_intervals) t1 = time.time() for r in ranges: tree.find(r[0], r[1]) t2 = time.time() print "interval tree: %.3f" %((t2-t1)*1000.0) t1 = time.time() for r in ranges: naive_find(init_intervals, r[0], r[1]) t2 = time.time() print "brute force: %.3f" %((t2-t1)*1000.0) on one run, i get: interval tree: 8584.682 brute force: 8201.644 is there anything wrong with this implementation? it seems very right to me but i am no expert. any help on this would be relly helpful. -- http://mail.python.org/mailman/listinfo/python-list
Re: efficient interval containment lookup
i forgot to add, my naive_find is: def naive_find(intervals, start, stop): results = [] for interval in intervals: if interval.start >= start and interval.stop <= stop: results.append(interval) return results On Jan 12, 11:55 pm, Per Freem wrote: > On Jan 12, 10:58 pm, Steven D'Aprano > > > > wrote: > > On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote: > > > thanks for your replies -- a few clarifications and questions. the > > > is_within operation is containment, i.e. (a,b) is within (c,d) iff a > > >>= c and b <= d. Note that I am not looking for intervals that > > > overlap... this is why interval trees seem to me to not be relevant, as > > > the overlapping interval problem is way harder than what I am trying to > > > do. Please correct me if I'm wrong on this... > > > To test for contained intervals: > > a >= c and b <= d > > > To test for overlapping intervals: > > > not (b < c or a > d) > > > Not exactly what I would call "way harder". > > > -- > > Steven > > hi Steven, > > i found an implementation (which is exactly how i'd write it based on > the description) > here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html > > when i use this however, it comes out either significantly slower or > equal to a naive search. my naive search just iterates through a > smallish list of intervals and for each one says whether they overlap > with each of a large set of intervals. > > here is the exact code i used to make the comparison, plus the code at > the link i have above: > > class Interval(): > def __init__(self, start, stop): > self.start = start > self.stop = stop > > import random > import time > num_ints = 3 > init_intervals = [] > for n in range(0, > num_ints): > start = int(round(random.random() > *1000)) > end = start + int(round(random.random()*500+1)) > init_intervals.append(Interval(start, end)) > num_ranges = 900 > ranges = [] > for n in range(0, num_ranges): > start = int(round(random.random() > *1000)) > end = start + int(round(random.random()*500+1)) > ranges.append((start, end)) > #print init_intervals > tree = IntervalTree(init_intervals) > t1 = time.time() > for r in ranges: > tree.find(r[0], r[1]) > t2 = time.time() > print "interval tree: %.3f" %((t2-t1)*1000.0) > t1 = time.time() > for r in ranges: > naive_find(init_intervals, r[0], r[1]) > t2 = time.time() > print "brute force: %.3f" %((t2-t1)*1000.0) > > on one run, i get: > interval tree: 8584.682 > brute force: 8201.644 > > is there anything wrong with this implementation? it seems very right > to me but i am no expert. any help on this would be relly helpful. -- http://mail.python.org/mailman/listinfo/python-list
Re: efficient interval containment lookup
hi brent, thanks very much for your informative reply -- didn't realize this about the size of the interval. thanks for the bx-python link. could you (or someone else) explain why the size of the interval makes such a big difference? i don't understand why it affects efficiency so much... thanks. On Jan 13, 12:24 am, brent wrote: > On Jan 12, 8:55 pm, Per Freem wrote: > > > > > On Jan 12, 10:58 pm, Steven D'Aprano > > > wrote: > > > On Mon, 12 Jan 2009 14:49:43 -0800, Per Freem wrote: > > > > thanks for your replies -- a few clarifications and questions. the > > > > is_within operation is containment, i.e. (a,b) is within (c,d) iff a > > > >>= c and b <= d. Note that I am not looking for intervals that > > > > overlap... this is why interval trees seem to me to not be relevant, as > > > > the overlapping interval problem is way harder than what I am trying to > > > > do. Please correct me if I'm wrong on this... > > > > To test for contained intervals: > > > a >= c and b <= d > > > > To test for overlapping intervals: > > > > not (b < c or a > d) > > > > Not exactly what I would call "way harder". > > > > -- > > > Steven > > > hi Steven, > > > i found an implementation (which is exactly how i'd write it based on > > the description) > > here:http://hackmap.blogspot.com/2008/11/python-interval-tree.html > > > when i use this however, it comes out either significantly slower or > > equal to a naive search. my naive search just iterates through a > > smallish list of intervals and for each one says whether they overlap > > with each of a large set of intervals. > > > here is the exact code i used to make the comparison, plus the code at > > the link i have above: > > > class Interval(): > > def __init__(self, start, stop): > > self.start = start > > self.stop = stop > > > import random > > import time > > num_ints = 3 > > init_intervals = [] > > for n in range(0, > > num_ints): > > start = int(round(random.random() > > *1000)) > > end = start + int(round(random.random()*500+1)) > > init_intervals.append(Interval(start, end)) > > num_ranges = 900 > > ranges = [] > > for n in range(0, num_ranges): > > start = int(round(random.random() > > *1000)) > > end = start + int(round(random.random()*500+1)) > > ranges.append((start, end)) > > #print init_intervals > > tree = IntervalTree(init_intervals) > > t1 = time.time() > > for r in ranges: > > tree.find(r[0], r[1]) > > t2 = time.time() > > print "interval tree: %.3f" %((t2-t1)*1000.0) > > t1 = time.time() > > for r in ranges: > > naive_find(init_intervals, r[0], r[1]) > > t2 = time.time() > > print "brute force: %.3f" %((t2-t1)*1000.0) > > > on one run, i get: > > interval tree: 8584.682 > > brute force: 8201.644 > > > is there anything wrong with this implementation? it seems very right > > to me but i am no expert. any help on this would be relly helpful. > > hi, the tree is inefficient when the interval is large. as the size of > the interval shrinks to much less than the expanse of the tree, the > tree will be faster. changing 500 to 50 in both cases in your script, > i get: > interval tree: 3233.404 > brute force: 9807.787 > > so the tree will work for limited cases. but it's quite simple. check > the tree in > bx-python:http://bx-python.trac.bx.psu.edu/browser/trunk/lib/bx/intervals/opera... > for a more robust implementation. > -brentp -- http://mail.python.org/mailman/listinfo/python-list
optimizing large dictionaries
hello i have an optimization questions about python. i am iterating through a file and counting the number of repeated elements. the file has on the order of tens of millions elements... i create a dictionary that maps elements of the file that i want to count to their number of occurs. so i iterate through the file and for each line extract the elements (simple text operation) and see if it has an entry in the dict: for line in file: try: elt = MyClass(line)# extract elt from line... my_dict[elt] += 1 except KeyError: my_dict[elt] = 1 i am using try/except since it is supposedly faster (though i am not sure about this? is this really true in Python 2.5?). the only 'twist' is that my elt is an instance of a class (MyClass) with 3 fields, all numeric. the class is hashable, and so my_dict[elt] works well. the __repr__ and __hash__ methods of my class simply return str() representation of self, while __str__ just makes everything numeric field into a concatenated string: class MyClass def __str__(self): return "%s-%s-%s" %(self.field1, self.field2, self.field3) def __repr__(self): return str(self) def __hash__(self): return hash(str(self)) is there anything that can be done to speed up this simply code? right now it is taking well over 15 minutes to process, on a 3 Ghz machine with lots of RAM (though this is all taking CPU power, not RAM at this point.) any general advice on how to optimize large dicts would be great too thanks for your help. -- http://mail.python.org/mailman/listinfo/python-list
Re: optimizing large dictionaries
thanks to everyone for the excellent suggestions. a few follow up q's: 1] is Try-Except really slower? my dict actually has two layers, so my_dict[aKey][bKeys]. the aKeys are very small (less than 100) where as the bKeys are the ones that are in the millions. so in that case, doing a Try-Except on aKey should be very efficient, since often it will not fail, where as if I do: "if aKey in my_dict", that statement will get executed for each aKey. can someone definitely say whether Try-Except is faster or not? My benchmarks aren't conclusive and i hear it both ways from several people (though majority thinks TryExcept is faster). 2] is there an easy way to have nested defaultdicts? ie i want to say that my_dict = defaultdict(defaultdict(int)) -- to reflect the fact that my_dict is a dictionary, whose values are dictionary that map to ints. but that syntax is not valid. 3] more importantly, is there likely to be a big improvement for splitting up one big dictionary into several smaller ones? if so, is there a straight forward elegant way to implement this? the way i am thinking is to just fix a number of dicts and populate them with elements. then during retrieval, try the first dict, if that fails, try the second, if not the third, etc... but i can imagine how that's more likely to lead to bugs / debugging give the way my code is setup so i am wondering whether it is really worth it. if it can lead to a factor of 2 difference, i will definitely implement it -- does anyone have experience with this? On Jan 15, 5:58 pm, Steven D'Aprano wrote: > On Thu, 15 Jan 2009 23:22:48 +0100, Christian Heimes wrote: > >> is there anything that can be done to speed up this simply code? right > >> now it is taking well over 15 minutes to process, on a 3 Ghz machine > >> with lots of RAM (though this is all taking CPU power, not RAM at this > >> point.) > > > class MyClass(object): > > # a new style class with slots saves some memory > > __slots__ = ("field1", "field2", "field2") > > I was curious whether using slots would speed up attribute access. > > >>> class Parrot(object): > > ... def __init__(self, a, b, c): > ... self.a = a > ... self.b = b > ... self.c = c > ...>>> class SlottedParrot(object): > > ... __slots__ = 'a', 'b', 'c' > ... def __init__(self, a, b, c): > ... self.a = a > ... self.b = b > ... self.c = c > ... > > >>> p = Parrot(23, "something", [1, 2, 3]) > >>> sp = SlottedParrot(23, "something", [1, 2, 3]) > > >>> from timeit import Timer > >>> setup = "from __main__ import p, sp" > >>> t1 = Timer('p.a, p.b, p.c', setup) > >>> t2 = Timer('sp.a, sp.b, sp.c', setup) > >>> min(t1.repeat()) > 0.83308887481689453 > >>> min(t2.repeat()) > > 0.62758088111877441 > > That's not a bad improvement. I knew that __slots__ was designed to > reduce memory consumption, but I didn't realise they were faster as well. > > -- > Steven -- http://mail.python.org/mailman/listinfo/python-list
test
sdfdsafasd -- http://mail.python.org/mailman/listinfo/python-list
Program eating memory, but only on one machine?
Hi Everybody: I'm having a difficult time figuring out a a memory use problem. I have a python program that makes use of numpy and also calls a small C module I wrote because part of the simulation needed to loop and I got a massive speedup by putting that loop in C. I'm basically manipulating a bunch of matrices, so nothing too fancy. That aside, when the simulation runs, it typically uses a relatively small amount of memory (about 1.5% of my 4GB of RAM on my linux desktop) and this never increases. It can run for days without increasing beyond this, running many many parameter set iterations. This is what happens both on my Ubuntu Linux machine with the following Python specs: Python 2.4.4c1 (#2, Oct 11 2006, 20:00:03) [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> numpy.version.version '1.0rc1' and also on my Apple MacBook with the following Python specs: Python 2.4.3 (#1, Apr 7 2006, 10:54:33) [GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> numpy.version.version '1.0.1.dev3435' >>> Well, that is the case on two of my test machines, but not on the one machine that I really wish would work, my lab's cluster, which would give me 20-fold increase in the number of processes I could run. On that machine, each process is using 2GB of RAM after about 1 hour (and the cluster MOM eventually kills them). I can watch the process eat RAM at each iteration and never relinquish it. Here's the Python spec of the cluster: Python 2.4.4 (#1, Jan 21 2007, 12:09:48) [GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-49)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> numpy.version.version '1.0.1' It also showed the same issue with the April 2006 2.4.3 release of python. I have tried using the gc module to force garbage collection after each iteration, but no change. I've done many newsgroup/google searches looking for known issues, but none found. The only major difference I can see is that our cluster is stuck on a really old version of gcc with the RedHat Enterprise that's on there, but I found no suggestions of memory issues online. So, does anyone have any suggestions for how I can debug this problem? If my program ate up memory on all machines, then I would know where to start and would blame some horrible programming on my end. This just seems like a less straightforward problem. Thanks for any help, Per -- http://mail.python.org/mailman/listinfo/python-list
RE: listdir reports [Error 1006] The volume for a file has been externally altered so that the opened file is no longer valid
FYI: the '/*.*' is part of the error message returned. -Original Message- From: ch...@rebertia.com [mailto:ch...@rebertia.com] On Behalf Of Chris Rebert Sent: Wednesday, January 07, 2009 6:40 PM To: Per Olav Kroka Cc: python-list@python.org Subject: Re: listdir reports [Error 1006] The volume for a file has been externally altered so that the opened file is no longer valid > PS: Why does the listdir() function add '*.*' to the path? Don't know what you're talking about. It doesn't do any globbing or add "*.*" to the path. Its exclusive purpose is to list the contents of a directory, so /in a sense/ it does add "*.*", but then not adding "*.*" would make the function completely useless given its purpose. > PS2: Why does the listdir() function add '/*.*' to the path on windows > and not '\\*.*' ? You can use either directory separator (\ or /) with the Python APIs on Windows. r"c:\WINDOWS\" works just as well as "c:/WINDOWS/". Cheers, Chris -- Follow the path of the Iguana... http://rebertia.com -- http://mail.python.org/mailman/listinfo/python-list