Re: how to debug python's extend module written in c/c++ on windows
Dear Diez: It is attaching a C-debugger to python. I can attach python- debugger(for example:wingIDE) to c-debugger(for example:VS2008), but I cannot attach VS2008 to wingIDE. I need both python statement and c statement can be single-step debugged. best regards fang -- http://mail.python.org/mailman/listinfo/python-list
Re: how to debug python's extend module written in c/c++ on windows
Dear Diez: I see. I appreciate your help really. best regards fang -- http://mail.python.org/mailman/listinfo/python-list
how to cut and paste in the windows of python(command line)
Dear all: The mouse cannot be responded in the windows of python(command line) and cut and paste cannot be done. ctrl c and ctrl v do not work. But they do work in IDLE. please teach me about the python(command line). -- http://mail.python.org/mailman/listinfo/python-list
Re: how to cut and paste in the windows of python(command line)
Dear: Thank you very much! I can do it! It's very nice! -- http://mail.python.org/mailman/listinfo/python-list
Re: Python GTK import error
sorry keep bugging you guys, but this is somehow a wield problem that I am eager to solve. I checked the permission and it looks fine to me: lrwxrwxrwx 1 root root 26 Nov 28 04:06 /usr/lib/libgtk-x11-2.0.so.0 -> libgtk-x11-2.0.so.0.400.13 -rwxr-xr-x 1 root root 2862900 Oct 15 14:23 /usr/lib/libgtk-x11-2.0.so.0.400.13 I even run with root, the error message "ImportError: /usr/lib/libgtk-x11-2.0.so.0: undefined symbol: atk_object_add_relationship" still comes out. I always try to avoid installing packages manually, rpm or apt-get is my preferred way. However, this time, it beats my working principle because this error message appeared after I upgraded packages using "synaptic", an apt-get GUI. now, I don't even trust apt-get:( Andrew James <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>... > Try checking the permissions on the link and the library - if you > installed the module manually the permissions may not allow anyone but > the root user to import the module. > > Andrew -- http://mail.python.org/mailman/listinfo/python-list
How to get the type of an object?
I wrote a function with a list as its parameter. And the function has to perform different operations based on the datatypes of the elements. How can I decide whether an object is, say, a list or a string? Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the type of an object?
That helps. Thank you guys. -- http://mail.python.org/mailman/listinfo/python-list
Python and STL efficiency
Hi, I'm learning STL and I wrote some simple code to compare the efficiency of python and STL. //C++ #include #include #include #include #include using namespace std; int main(){ vector a; for (long int i=0; i<1 ; ++i){ a.push_back("What do you know?"); a.push_back("so long..."); a.push_back("chicken crosses road"); a.push_back("fool"); } set b(a.begin(), a.end()); unique_copy(b.begin(), b.end(), ostream_iterator(cout, "\n")); } #python def f(): a = [] for i in range(1): a.append('What do you know') a.append('so long...') a.append('chicken crosses road') a.append('fool') b = set(a) for s in b: print s I was using VC++.net and IDLE, respectively. I had expected C++ to be way faster. However, while the python code gave the result almost instantly, the C++ code took several seconds to run! Can somebody explain this to me? Or is there something wrong with my code? -- http://mail.python.org/mailman/listinfo/python-list
Re: Python and STL efficiency
Marc 'BlackJack' Rintsch wrote: > In <[EMAIL PROTECTED]>, Licheng Fang > wrote: > > > Hi, I'm learning STL and I wrote some simple code to compare the > > efficiency of python and STL. > > > > //C++ > > #include > > #include > > #include > > #include > > #include > > using namespace std; > > > > int main(){ > > vector a; > > for (long int i=0; i<1 ; ++i){ > > a.push_back("What do you know?"); > > a.push_back("so long..."); > > a.push_back("chicken crosses road"); > > a.push_back("fool"); > > } > > set b(a.begin(), a.end()); > > unique_copy(b.begin(), b.end(), ostream_iterator(cout, "\n")); > > } > > Why are you using `unique_copy` here? Sorry, that's a typo. Actually I used 'copy'. > > > #python > > def f(): > > a = [] > > for i in range(1): > > a.append('What do you know') > > a.append('so long...') > > a.append('chicken crosses road') > > a.append('fool') > > b = set(a) > > for s in b: > > print s > > > > I was using VC++.net and IDLE, respectively. I had expected C++ to be > > way faster. However, while the python code gave the result almost > > instantly, the C++ code took several seconds to run! Can somebody > > explain this to me? Or is there something wrong with my code? > > There's a difference in data structures at least. The Python `set` type > is implemented with a hash algorithm, so the equivalent STL type would be > `hash_set`. `set` in Python does not store its contents sorted. > > Ciao, > Marc 'BlackJack' Rintsch Thank you for your comments. I tested with hash_set, but I didn't see much performance improvement. When I increased the loop to 1 million times, the python code still ran reasonably fast and the C++ code got stuck there. This totally surprised me, because according this page http://norvig.com/python-lisp.html, the speed of python is nowhere near that of C++. -- http://mail.python.org/mailman/listinfo/python-list
Re: Modules... paths... newbie confusion
MrBlueSky wrote: > I wonder if someone could clarify how Python "knows" where modules are > - or at least point to some documentation that might help me? Here's > what I've been trying: > > I've installed Python 2.4 Windows, and have also installed tkinter, > pmw, cx_Oracle, mssql and pytz (phew!) all under my c:\python24 folder. > > But when I try to "import pytz" or "import MSSQL" in a Python shell > (via IDLE) it's not recognised - yet "import Tkinter", "import Pmw" and > "import cx_Oracle" all work. > > I've experimented with "sys.path" to get the import of pytz to work, > but without success so far. > > I feel as if I'm missing some key piece of information on how this all > fits together! Please, help! > > John You may have to add the path of the module to a system environment variable "PYTHONPATH" to make it work. -- http://mail.python.org/mailman/listinfo/python-list
How to get the "longest possible" match with Python's RE module?
Basically, the problem is this: >>> p = re.compile("do|dolittle") >>> p.match("dolittle").group() 'do' Python's NFA regexp engine trys only the first option, and happily rests on that. There's another example: >>> p = re.compile("one(self)?(selfsufficient)?") >>> p.match("oneselfsufficient").group() 'oneself' The Python regular expression engine doesn't exaust all the possibilities, but in my application I hope to get the longest possible match, starting from a given point. Is there a way to do this in Python? -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
MonkeeSage wrote: > Licheng Fang wrote: > > Basically, the problem is this: > > > > >>> p = re.compile("do|dolittle") > > >>> p.match("dolittle").group() > > 'do' > > >From what I understand, this isn't python specific, it is the expected > behavior of that pattern in any implementation. You are using > alternation, which means "either, or", and you have the shorter > subexpression first, so the condition is satisfied by just 'do' and the > matching terminates. > > > There's another example: > > > > >>> p = re.compile("one(self)?(selfsufficient)?") > > >>> p.match("oneselfsufficient").group() > > 'oneself' > > Again, I don't think this has anything to do with python. You pattern > basically means "match 'one' whether it is followed by 'self' or not, > and whether it is followed by 'selfsufficient' or not". For this > particular example, you'd want something like > "one(self)?(sufficient)?". > > I think you could construct a pattern that would do what you want in > python without any problem. If you post a (short) example of your data, > I'm sure someone could help you with it. > > Regards, > Jordan Hi, according to these regexp engine discussions, it's NOT a behavior true to any implementation. http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html Python's NFA engine reads along the input string, matching it to the pattern, and backtracking when needed. By contrast a DFA engine, to my understanding, constructs a DFA and uses it to munch as many characters as possible. Maybe it's like this: Pattern: one(self)?(selfsufficient)? PYTHON'S NFA ENGINE: one self, noneselfsufficient, none (start)--->((1))>((2))--->((3)) DFA ENGINE: one self (start)--->((123))>((23)) | | | selfsufficient --->((3)) I want to know if there is some way to make Python RE behave like grep does, or do I have to change to another engine? -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
Oh, please do have a look at the second link I've posted. There's a table comparing the regexp engines. The engines you've tested probably all use an NFA implementation. MonkeeSage wrote: > Licheng Fang wrote: > > Hi, according to these regexp engine discussions, it's NOT a behavior > > true to any implementation. > > [snip] > > Well, I just double-checked in ruby (oniguruma regexp engine): > > r = Regexp.new("do|dolittle") > puts r.match("dolittle")[0] > # do > > r = Regexp.new("one(self)?(sufficient)?") > puts r.match("oneselfsufficient")[0] > # oneself > > And perl: > > if ("doolittle" =~ > /(do|dolittle)/) { > print "$1\n"; > # do > } > > if ("oneselfsufficient" =~ > /(one(self)?(selfsufficient)?)/) { > print "$1\n"; > # oneself > } > > And Javascript (whatever regexp engine Spidermonkey uses): > > var r = new RegExp(/do|dolittle/); > alert("dolittle".match(r)[0]); > > var r = new RegExp(/one(self)?(selfsufficient)?/); > alert("oneselfsufficient".match(r)[0]); > > So, it seems they are all broken, or python is correct as well. > > Regards, > Jordan -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
Thank you very much, Tim and Monkee. In fact, what I'm doing is handle a lot of regular expressions. I wanted to build VERY LONG regexps part by part and put them all into a file for easy modification and maintenance. The idea is like this: (*INT) = \d+ (*DECIMAL) = (*INT)\.(*INT) (*FACTION) = (*DECIMAL)/(*DECIMAL) (*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT) ... ... What's inside the sytactically wrong (* and ) is something to be replaced, and then I wrote a little script to do the string substitution, to get very long regexps to be compiled. I thought that might be a way to handle long and complex regexps, and in this course I encountered the problem with the semantics of '|'. My approach may sound desperate and funny, but please, do you have any good idea as to how to handle long and complex regexps? -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
Bryan Olson wrote: > Licheng Fang wrote: > > Oh, please do have a look at the second link I've posted. There's a > > table comparing the regexp engines. The engines you've tested probably > > all use an NFA implementation. > > Unfortunately, the stuff about NFA's is wrong. Friedl's awful > book was the first time I saw this confusion about what NFA is; > I don't know if he originated the mess or just propagated it. > > "Nondeterministic finite automata" is well defined in theory > of computation. The set of languages accepted by NFA's is > exactly the same as the set accepted by DFA's. > > What Python uses is search-and-backtrack. Unfortunately such > engines don't have much theory behind them, and it's hard to > reason generally about what they do. > > > -- > --Bryan Thanks for the valuable information. Indeed, when I read the pages, I was a little confused about what it meant by 'NFA'. But I faintly felt, there might be an indirect, and not not very exact mapping from the search-and-backtrack strategy to NFAs in the sense of computer science, e.g. a state machine with the capability to be in several states at one time. Say, when reading 'oneselfsufficient', the engine goes along the NFA first to state 1, and then faces the choice between one self, noneselfsufficient, none (start)--->((1))>((2))--->((3)) 1) matching 'self', 2) going on to state 2 without matching anything, or 3) just give 'one' as the matching result because state 1 is already a terminal state In such situations it always chooses the greedy way. To match more, it goes to the state 2, munching 'self'. And now it's left with only 'sufficient'. Here it's choices are: 1) matching nothing and going to state 3 2) just give 'oneself' as result because state 2 is also a terminal state Again it's greedy, going on to state 3, in hope of matching more. But here the pattern comes to an end, represented by state 3 as a terminal state. So the engine gives 'oneself' as result and forgets about its previously unexplored possibilities, because it only performs backtrack when a match cannot be found. I think if the backtrack is carried out in an exaustive way, we may say the engine trys every possibility on the NFA, though it's not an NFA itself. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
[EMAIL PROTECTED] wrote: > kondal wrote: > > > This is the way the regexp works python doesn't has anything to do with > > it. It starts parsing the data with the pattern given. It returns the > > matched string acording the pattern and doesn't go back to find the > > other combinations. > > I've recently had the same problem in Java, using automatically > generated regular expressions to find the longest match; I failed on > cases like matching the whole of "Abcdefg", but also the whole of > "AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? . > No systematic way to deal with these corner cases was available, and > unsystematic ways (with greedy and reluctant quantifiers) were too > complex. > I ended up eliminating regular expressions completely and building a > dynamic programming parser that returns the set of all match lengths; > it wasn't hard and it should be even easier in Python. > > Lorenzo Gatti Thanks. I think make use of the expresiveness of CFG may be better idea. Another question: my task is to find in a given string the substrings that satisfies a particular pattern. That's why the first tool that came to my mind is regular expression. Parsers, however, only give a yes/no answer to a given string. To find all substrings with a particular pattern I may have to try every substring, which may be an impossible task. How can I solve this problem? -- http://mail.python.org/mailman/listinfo/python-list
Re: How to get the "longest possible" match with Python's RE module?
Thank you guys. I've written a CYK parser and realized this is the right direction. It gives every possible interpretation of the string and I can retrieve whatever I want. -- http://mail.python.org/mailman/listinfo/python-list
extremely slow array indexing?
Hi, I am writing code to sort the columns according to the sum of each column. The dataset is huge (50k rows x 300k cols), so i need to read line by line and do the summation to avoid the out-of-memory problem. But I don't know why it runs very slow, and part of the code is as follows. I suspect it's because of array index, but not sure. Can anyone point out what needs to be modified to make it run fast? thanks in advance! ... from numpy import * ... currSum = zeros(self.componentcount) currRow = zeros(self.componentcount) for featureDict in self.featureDictList: currRow[:] = 0 for components in self.componentdict1: if featureDict.has_key(components): col = self.componentdict1[components] value = featureDict[components] currRow[col]=value; currSum = currSum + row; ... -- http://mail.python.org/mailman/listinfo/python-list
Re: extremely slow array indexing?
Hi will,Thanks for your reply. The simplified code is as follows, and you can run it if you like. It takes 7 seconds to process 1000 rows, which is tolerable, but I wonder why it takes so long, because I also did one for loop through all of the same rows without accessing array, which only takes 1 sec to process 1000 rows. Isn't vectorized operation supposed to run very quickly? from numpy import * componentcount = 30 currSum = zeros(componentcount) row = zeros(componentcount) #current row rowcount = 5 for i in range(1,rowcount): row[:] = 1 currSum = currSum + row; > Array indexing is unlikely to be the culprit. Could it not just be > slow, because you are processing a lot of data? With numbers those big > I would expect to have enough time to go make a coffee, then drink it. > > If you think it is slower than it could be, post more code for > optimization advice... > > Will McGugan > -- > http://www.willmcgugan.com -- http://mail.python.org/mailman/listinfo/python-list
How to configure proxy authentication when using urllib2?
I use a HTTP proxy to connect to Internet. When I run ulropen command I get HTTP Error 407: Proxy authorization required. Could anybody tell me how to resolve this? Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Problem of Readability of Python
Python is supposed to be readable, but after programming in Python for a while I find my Python programs can be more obfuscated than their C/C ++ counterparts sometimes. Part of the reason is that with heterogeneous lists/tuples at hand, I tend to stuff many things into the list and *assume* a structure of the list or tuple, instead of declaring them explicitly as one will do with C structs. So, what used to be struct nameval { char * name; int val; } a; a.name = ... a.val = ... becomes cryptic a[0] = ... a[1] = ... Python Tutorial says an empty class can be used to do this. But if namespaces are implemented as dicts, wouldn't it incur much overhead if one defines empty classes as such for some very frequently used data structures of the program? Any elegant solutions? -- http://mail.python.org/mailman/listinfo/python-list
Re: Problem of Readability of Python
On Oct 8, 4:24 am, [EMAIL PROTECTED] (Alex Martelli) wrote: > Licheng Fang <[EMAIL PROTECTED]> wrote: > >... > > > Python Tutorial says an empty class can be used to do this. But if > > namespaces are implemented as dicts, wouldn't it incur much overhead > > if one defines empty classes as such for some very frequently used > > data structures of the program? > > Just measure: > > $ python -mtimeit -s'class A(object):pass' -s'a=A()' 'a.zop=23' > 100 loops, best of 3: 0.241 usec per loop > > $ python -mtimeit -s'a=[None]' 'a[0]=23' > 1000 loops, best of 3: 0.156 usec per loop > > So, the difference, on my 18-months-old laptop, is about 85 nanoseconds > per write-access; if you have a million such accesses in a typical run > of your program, it will slow the program down by about 85 milliseconds. > Is that "much overhead"? If your program does nothing else except those > accesses, maybe, but then why are your writing that program AT ALL?-) > > And yes, you CAN save about 1/3 of those 85 nanoseconds by having > '__slots__=["zop"]' in your class A(object)... but that's the kind of > thing one normally does only to tiny parts of one's program that have > been identified by profiling as dramatic bottlenecks, to shave off the > last few nanoseconds in the very last stages of micro-optimization of a > program that's ALMOST, but not QUITE, fast enough... knowing about such > "extreme last-ditch optimization tricks" is of very doubtful value (and > I think I'm qualified to say that, since I _do_ know many of them...:-). > There ARE important performance things to know about Python, but those > worth a few nanoseconds don't matter much. > > Alex This is enlightening. Surely I shouldn't have worried too much about performance before doing some measurement. -- http://mail.python.org/mailman/listinfo/python-list
Accessing Function Variables from Sub-functions
On Apr 14 2003, 10:30 pm, Alex Martelli <[EMAIL PROTECTED]> wrote: > Sebastian Wilhelmi wrote: > > Hi, > > > I would like to do the following: > > > ---8<---8<---8<---8<--- > > def test (): > > count = 0 > > def inc_count (): > > count += 1 > > inc_count () > > inc_count () > > print count > > > test () > > ---8<---8<---8<---8<--- > > > This doesn't work (and I even understand, why ;-) > > Specifically: a nested function cannot *RE-BIND* a variable of > an outer function. > Sorry to dig up this old thread, but I would like to know what's the rationale is. Why can't a nested function rebind a variable of an outer function? > > > > One solution is the following, which I however do not see as very > > clean or nice. > > > ---8<---8<---8<---8<--- > > def test (): > > count = [0] > > def inc_count (): > > count[0] += 1 > > inc_count () > > inc_count () > > print count[0] > > > test () > > ---8<---8<---8<---8<--- > > > Now my question: Is there some way to achieve this with a nicer > > syntax? > > Depends on your tastes in syntax, e.g.: > > def test(): > class Bunch: pass > loc = Bunch() > loc.count = 0 > def inc_count(): > loc.count += 1 > inc_count() > inc_count() > print loc.count > > or: > > def test(): > test.count = 0 > def inc_count(): > test.count += 1 > inc_count() > inc_count() > print test.count > > and no doubt quite a few others. I was trying to write a function that creates another function and returns it when I came across this problem. These two solutions have a difference: def M(): M.c = 0 class Bunch: pass Bunch.c = 0 def f(): M.c += 1 Bunch.c += 1 print M.c, Bunch.c return f >>>f = M() >>>f2 = M() >>> f() 1 1 >>> f() 2 2 >>> f() 3 3 >>> f2() 4 1 >>> f2() 5 2 >>> f2() 6 3 The created functions share their variables binded to the outer function, but have their separate copies of variables bundled in a class. Is binding name to the function object a python way to use 'static' variables? But how to initialize them? > > > Using 'global' would not count, as that would make a variable > > unnecessarily known to the outside. > > The two solutions above outlined differ in this respect -- variable > 'loc' in the former is local to function test, while function attribute > test.count in the latter is "known to the outside" (it survives each > execution of test and can be examined "from the outside"). > > Class Bunch (or some highly refined version thereof) IS typically > around whenever I program, see for example: > > http://mail.python.org/pipermail/python-list/2002-July/112007.html > > for a typical presentation. Having available the MetaBunch there > described, I'd start the function as follows: > > def test(): > class Bunch(MetaBunch): count=0 > loc = MetaBunch() > def inc_count(): > loc.count += 1 > > etc. I do find such constructs often handy... > > Alex -- http://mail.python.org/mailman/listinfo/python-list
How can I create customized classes that have similar properties as 'str'?
I mean, all the class instances that equal to each other should be reduced into only one instance, which means for instances of this class there's no difference between a is b and a==b. Thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
I find myself frequently in need of classes like this for two reasons. First, it's efficient in memory. Second, when two instances are compared for equality only their pointers are compared. (I think that's how Python compares 'str's. On Nov 24, 6:31 pm, Licheng Fang <[EMAIL PROTECTED]> wrote: > I mean, all the class instances that equal to each other should be > reduced into only one instance, which means for instances of this > class there's no difference between a is b and a==b. > > Thank you. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 7:05 pm, Bjoern Schliessmann wrote: > Licheng Fang wrote: > > I find myself frequently in need of classes like this for two > > reasons. First, it's efficient in memory. > > Are you using millions of objects, or MB size objects? Otherwise, > this is no argument. Yes, millions. In my natural language processing tasks, I almost always need to define patterns, identify their occurrences in a huge data, and count them. Say, I have a big text file, consisting of millions of words, and I want to count the frequency of trigrams: trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)] I can save the counts in a dict D1. Later, I may want to recount the trigrams, with some minor modifications, say, doing it on every other line of the input file, and the counts are saved in dict D2. Problem is, D1 and D2 have almost the same set of keys (trigrams of the text), yet the keys in D2 are new instances, even though these keys probably have already been inserted into D1. So I end up with unnecessary duplicates of keys. And this can be a great waste of memory with huge input data. > > BTW, what happens if you, by some operation, make a == b, and > afterwards change b so another object instance must be created? > This instance management is quite a runtime overhead. > I probably need this class to be immutable. > > Second, when two instances are compared for equality only their > > pointers are compared. > > I state that the object management will often eat more performance > than equality testing. Except you have a huge number of equal > objects. If the latter was the case you should rethink your program > design. > Yeah, counting is all about equal or not. > > (I think that's how Python compares 'str's. > > Generally not. In CPython, just very short strings are created only > once. > > >>> a=" " > >>> b=" " > >>> a is b > True > >>> a=" " > >>> b=" " > >>> a is b > Wow, I didn't know this. But exactly how Python manage these strings? My interpretator gave me such results: >>> a = 'this' >>> b = 'this' >>> a is b True >>> a = 'this is confusing' >>> b = 'this is confusing' >>> a is b False > False > > Regards, > > Björn > > -- > BOFH excuse #430: > > Mouse has out-of-cheese-error -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 24, 9:42 pm, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote: > > Licheng Fang wrote: > >> On Nov 24, 7:05 pm, Bjoern Schliessmann > >> Wow, I didn't know this. But exactly how Python manage these > >> strings? > > > I don't know (use the source, Luke). :) Or perhaps there is a Python > > Elder here that knows? > > AFAIK strings of length 1 and strings that would be valid Python > identifiers are treated this way. > > Ciao, > Marc 'BlackJack' Rintsch Thanks. Then, is there a way to make python treat all strings this way, or any other kind of immutable objects? -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 25, 5:59 am, Steven D'Aprano <[EMAIL PROTECTED] cybersource.com.au> wrote: > On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote: > > On Nov 24, 7:05 pm, Bjoern Schliessmann > [EMAIL PROTECTED]> wrote: > >> Licheng Fang wrote: > >> > I find myself frequently in need of classes like this for two > >> > reasons. First, it's efficient in memory. > > >> Are you using millions of objects, or MB size objects? Otherwise, this > >> is no argument. > > > Yes, millions. > > Oh noes!!! Not millions of words That's like, oh, a few tens of > megabytes1! How will a PC with one or two gigabytes of RAM cope? > > Tens of megabytes is not a lot of data. > > If the average word size is ten characters, then one million words takes > ten million bytes, or a little shy of ten megabytes. Even if you are > using four-byte characters, you've got 40 MB, still a moderate amount of > data on a modern system. I mentioned trigram counting as an illustrative case. In fact, you'll often need to define patterns more complex than that, and tens of megabytes of text may generate millions of them, and I've observed they quickly ate up the 8G memory of a workstation in a few minutes. Manipulating these patterns can be tricky, you can easily waste a lot of memory without taking extra care. I just thought if I define my pattern class with this 'atom' property, coding efforts could be easier later. > > > In my natural language processing tasks, I almost always > > need to define patterns, identify their occurrences in a huge data, and > > count them. Say, I have a big text file, consisting of millions of > > words, and I want to count the frequency of trigrams: > > > trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)] > > > I can save the counts in a dict D1. Later, I may want to recount the > > trigrams, with some minor modifications, say, doing it on every other > > line of the input file, and the counts are saved in dict D2. Problem is, > > D1 and D2 have almost the same set of keys (trigrams of the text), yet > > the keys in D2 are new instances, even though these keys probably have > > already been inserted into D1. So I end up with unnecessary duplicates > > of keys. And this can be a great waste of memory with huge input data. > > All these keys will almost certainly add up to only a few hundred > megabytes, which is a reasonable size of data but not excessive. This > really sounds to me like a case of premature optimization. I think you > are wasting your time solving a non-problem. > > [snip] > > > Wow, I didn't know this. But exactly how Python manage these strings? My > > interpretator gave me such results: > > >>>> a = 'this' > >>>> b = 'this' > >>>> a is b > > True > >>>> a = 'this is confusing' > >>>> b = 'this is confusing' > >>>> a is b > > False > > It's an implementation detail. You shouldn't use identity testing unless > you actually care that two names refer to the same object, not because > you want to save a few bytes. That's poor design: it's fragile, > complicated, and defeats the purpose of using a high-level language like > Python. > > -- > Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: How can I create customized classes that have similar properties as 'str'?
On Nov 27, 10:45 am, Steven D'Aprano <[EMAIL PROTECTED]> wrote: > On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote: > > I mentioned trigram counting as an illustrative case. In fact, you'll > > often need to define patterns more complex than that, and tens of > > megabytes of text may generate millions of them, and I've observed they > > quickly ate up the 8G memory of a workstation in a few minutes. > > Manipulating these patterns can be tricky, you can easily waste a lot of > > memory without taking extra care. I just thought if I define my pattern > > class with this 'atom' property, coding efforts could be easier later. > > I'm just not getting the same results as you when I try this. I'm finding > that with no extra effort at all, it just works. > > The size of your corpus is not important. Neither is the complexity of > how you generate the patterns. What's important is the number of patterns > you produce, and "millions" isn't that huge a number, even without > interning the strings. > > Obviously I'm not running your code, but I can build a dict with millions > of patterns, from hundreds of megabytes of text, on a PC with just 1GB of > memory and not run into any serious problems. > > I've just processed roughly 240MB of random emails, generating n-grams up > to length 5. The emails include binary attachments and HTML etc., so I'm > getting lots of patterns that don't normally exist in natural languages > (e.g. 71 occurrences of 'qqq', and 5 of ''). As I said, my PC has > only 1GB, and that's being shared with about a dozen other apps (including > such memory hogs as Firefox). > > Results? 64939962 patterns found, of which 17031467 are unique. There's > paging, yes, and my PC runs a little slow when I try to switch from one > running application to another, but nothing unusable. Opening a dozen > YouTube videos at once impacts performance worse. > > I can't think what you're doing to use up 8GB of RAM for merely > "millions" of strings, even if you are keeping two, three, ten redundant > copies. Assuming an average length of twenty bytes per pattern (you said > trigrams, but I'd rather over-estimate than under), and even assuming > that only half the 8GB are available to Python, you should be able to > store something of the order of one hundred million patterns easily: My task is identifying sequences of tokens (phrases) that are possible tranlations of each other from a bilingual corpus. I need to check all the subsequences of a sentence to get the possible phrase pairs. This makes the problem different from n-gram counting in that the number of possible phrases doesn't grow linearly with n, but approximately with n^2. (n being the sentence length.) My first version of the program consumes almost twice as much memory as the current one because I discovered in doing different statistics I was regenerating the patterns, and the counting dictionaries ended up with duplicated pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I can restrict the pattern class to not generate identical instances? So I can avoid such subtle but significant bugs. > > 4 bytes for a pointer plus 20 bytes for the string = 24 bytes > > 4*1024**3 / 24 = 178,956,970 > > (This is a ballpark figure. Real Python strings will have additional > overhead.) > > If you're running into problems with merely millions of patterns, then > you're doing worse by probably two orders of magnitude. > > I don't think that the problem lies where you think it does. If you have > a dict with millions of keys, it doesn't matter how many times each > pattern exists in the corpus because the key only exists *once* in the > dict. Duplicate the dict, or generate it from scratch even, and at worse > you double the memory used by the keys. And probably not even that. > > The only thing I can think of that might explain why you're using so much > memory is if you are generating *all* the patterns up front, say in a > list, before adding them to the dict: > > # Generate one massive list of patterns containing many duplicates > patterns = make_patterns(text) > # returns a massive list like ['fre', 'req', 'equ', 'que' ...] > d = {} > for pattern in patterns: > d[pattern] = d.get(pattern, 0) + 1 > No, I wasn't doing that. BTW, do you think the pattern counting code can avoid hashing the pattern twice? Is there a way to do that when the dictionary values are of a primitive type? > Notice that the real killer in the above algorithm is that you need > enough storage, not just for the
Nested Lists Assignment Problem
I wanna use nested lists as an array, but here's the problem: >>> a = [[0]*3]*3 >>> a [[0, 0, 0], [0, 0, 0], [0, 0, 0]] >>> a[0][0] = 1 >>> a [[1, 0, 0], [1, 0, 0], [1, 0, 0]] Could anybody please explain to me why three values were change? I'm bewildered. Thanks! -- http://mail.python.org/mailman/listinfo/python-list
Re: Nested Lists Assignment Problem
Dennis Lee Bieber wrote: > On 26 Apr 2006 01:13:20 -0700, "Licheng Fang" <[EMAIL PROTECTED]> > declaimed the following in comp.lang.python: > > > > > > Could anybody please explain to me why three values were change? I'm > > bewildered. Thanks! > > http://www.aifb.uni-karlsruhe.de/Lehrangebot/Winter2000-01/E-Shop/Python/Doku/The%20Whole%20Python%20FAQ.htm#4.50 > -- > > == < > > [EMAIL PROTECTED] | Wulfraed Dennis Lee Bieber KD6MOG < > > [EMAIL PROTECTED] | Bestiaria Support Staff < > > == < > > Home Page: <http://www.dm.net/~wulfraed/>< > >Overflow Page: <http://wlfraed.home.netcom.com/>< Thank you very much! But I still wonder why a nested assignment "a = [[0]*3]*3" generates 3 references to the same list, while the commands below apparently do not. >>> a = [0] * 2 >>> a [0, 0] >>> a[0] = 1 >>> a [1, 0] -- http://mail.python.org/mailman/listinfo/python-list