from:"fang"

Re: how to debug python's extend module written in c/c++ on windows

2008-07-17 Thread fang

Dear Diez:

It is attaching a C-debugger to python. I can attach python-
debugger(for example:wingIDE) to c-debugger(for example:VS2008), but I
cannot attach VS2008 to wingIDE. I need both python statement and c
statement can be single-step debugged.

best regards

fang
--
http://mail.python.org/mailman/listinfo/python-list

Re: how to debug python's extend module written in c/c++ on windows

2008-07-17 Thread fang

Dear Diez:

   I see. I appreciate your help really.


best regards


fang

--
http://mail.python.org/mailman/listinfo/python-list

how to cut and paste in the windows of python(command line)

2008-07-24 Thread fang

Dear all:

  The mouse cannot be responded in the windows of python(command
line) and cut and paste cannot be done. ctrl c and ctrl v do not work.
But they do work in IDLE.
  please teach me about the python(command line).
--
http://mail.python.org/mailman/listinfo/python-list

Re: how to cut and paste in the windows of python(command line)

2008-07-24 Thread fang

Dear:

Thank you very much! I can do it! It's very nice!
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python GTK import error

2004-11-29 Thread Qianqian Fang

sorry keep bugging you guys, but this is somehow a wield problem that
I am eager to solve.

I checked the permission and it looks fine to me:


lrwxrwxrwx  1 root root 26 Nov 28 04:06 /usr/lib/libgtk-x11-2.0.so.0
-> libgtk-x11-2.0.so.0.400.13

-rwxr-xr-x  1 root root 2862900 Oct 15 14:23
/usr/lib/libgtk-x11-2.0.so.0.400.13


I even run with root, the error message "ImportError:
/usr/lib/libgtk-x11-2.0.so.0: undefined symbol:
atk_object_add_relationship" still comes out.


I always try to avoid installing packages manually, rpm or apt-get is
my preferred way. However, this time, it beats my working principle
because this error message appeared after I upgraded packages using
"synaptic", an apt-get GUI. now, I don't even trust apt-get:(




Andrew James <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>...
> Try checking the permissions on the link and the library - if you
> installed the module manually the permissions may not allow anyone but
> the root user to import the module.
> 
> Andrew
-- 
http://mail.python.org/mailman/listinfo/python-list

How to get the type of an object?

2005-12-21 Thread Licheng Fang

I wrote a function with a list as its parameter. And the function has
to perform different operations based on the datatypes of the elements.
How can I decide whether an object is, say, a list or a string?

Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to get the type of an object?

2005-12-21 Thread Licheng Fang

That helps. Thank you guys.

-- 
http://mail.python.org/mailman/listinfo/python-list

Python and STL efficiency

2006-08-20 Thread Licheng Fang

Hi, I'm learning STL and I wrote some simple code to compare the
efficiency of python and STL.

//C++
#include 
#include 
#include 
#include 
#include 
using namespace std;

int main(){
vector a;
for (long int i=0; i<1 ; ++i){
a.push_back("What do you know?");
a.push_back("so long...");
a.push_back("chicken crosses road");
a.push_back("fool");
}
set b(a.begin(), a.end());
unique_copy(b.begin(), b.end(), ostream_iterator(cout, "\n"));
}

#python
def f():
a = []
for i in range(1):
a.append('What do you know')
a.append('so long...')
a.append('chicken crosses road')
a.append('fool')
b = set(a)
for s in b:
print s

I was using VC++.net and IDLE, respectively. I had expected C++ to be
way faster. However, while the python code gave the result almost
instantly, the C++ code took several seconds to run! Can somebody
explain this to me? Or is there something wrong with my code?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python and STL efficiency

2006-08-21 Thread Licheng Fang


Marc 'BlackJack' Rintsch wrote:
> In <[EMAIL PROTECTED]>, Licheng Fang
> wrote:
>
> > Hi, I'm learning STL and I wrote some simple code to compare the
> > efficiency of python and STL.
> >
> > //C++
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > using namespace std;
> >
> > int main(){
> > vector a;
> > for (long int i=0; i<1 ; ++i){
> > a.push_back("What do you know?");
> > a.push_back("so long...");
> > a.push_back("chicken crosses road");
> > a.push_back("fool");
> > }
> > set b(a.begin(), a.end());
> > unique_copy(b.begin(), b.end(), ostream_iterator(cout, "\n"));
> > }
>
> Why are you using `unique_copy` here?

Sorry, that's a typo. Actually I used 'copy'.
>
> > #python
> > def f():
> > a = []
> > for i in range(1):
> > a.append('What do you know')
> > a.append('so long...')
> > a.append('chicken crosses road')
> > a.append('fool')
> > b = set(a)
> > for s in b:
> > print s
> >
> > I was using VC++.net and IDLE, respectively. I had expected C++ to be
> > way faster. However, while the python code gave the result almost
> > instantly, the C++ code took several seconds to run! Can somebody
> > explain this to me? Or is there something wrong with my code?
>
> There's a difference in data structures at least.  The Python `set` type
> is implemented with a hash algorithm, so the equivalent STL type would be
> `hash_set`.  `set` in Python does not store its contents sorted.
>
> Ciao,
>   Marc 'BlackJack' Rintsch

Thank you for your comments. I tested with hash_set, but I didn't see
much performance improvement. When I increased the loop to 1 million
times, the python code still ran reasonably fast and the C++ code got
stuck there. This totally surprised me, because according this page
http://norvig.com/python-lisp.html, the speed of python is nowhere near
that of C++.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Modules... paths... newbie confusion

2006-08-21 Thread Licheng Fang


MrBlueSky wrote:
> I wonder if someone could clarify how Python "knows" where modules are
> - or at least point to some documentation that might help me?  Here's
> what I've been trying:
>
> I've installed Python 2.4 Windows, and have also installed tkinter,
> pmw, cx_Oracle, mssql and pytz (phew!) all under my c:\python24 folder.
>
> But when I try to "import pytz" or "import MSSQL" in a Python shell
> (via IDLE) it's not recognised - yet "import Tkinter", "import Pmw" and
> "import cx_Oracle" all work.
>
> I've experimented with "sys.path" to get the import of pytz to work,
> but without success so far.
>
> I feel as if I'm missing some key piece of information on how this all
> fits together!  Please, help!
>
> John

You may have to add the path of the module to a system environment
variable "PYTHONPATH" to make it work.

-- 
http://mail.python.org/mailman/listinfo/python-list

How to get the "longest possible" match with Python's RE module?

2006-09-11 Thread Licheng Fang

Basically, the problem is this:

>>> p = re.compile("do|dolittle")
>>> p.match("dolittle").group()
'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:

>>> p = re.compile("one(self)?(selfsufficient)?")
>>> p.match("oneselfsufficient").group()
'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to get the "longest possible" match with Python's RE module?

2006-09-11 Thread Licheng Fang


MonkeeSage wrote:
> Licheng Fang wrote:
> > Basically, the problem is this:
> >
> > >>> p = re.compile("do|dolittle")
> > >>> p.match("dolittle").group()
> > 'do'
>
> >From what I understand, this isn't python specific, it is the expected
> behavior of that pattern in any implementation. You are using
> alternation, which means "either, or", and you have the shorter
> subexpression first, so the condition is satisfied by just 'do' and the
> matching terminates.
>
> > There's another example:
> >
> > >>> p = re.compile("one(self)?(selfsufficient)?")
> > >>> p.match("oneselfsufficient").group()
> > 'oneself'
>
> Again, I don't think this has anything to do with python. You pattern
> basically means "match 'one' whether it is followed by 'self' or not,
> and whether it is followed by 'selfsufficient' or not". For this
> particular example, you'd want something like
> "one(self)?(sufficient)?".
>
> I think you could construct a pattern that would do what you want in
> python without any problem. If you post a (short) example of your data,
> I'm sure someone could help you with it.
>
> Regards,
> Jordan

Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.

http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html

Python's NFA engine reads along the input string, matching it to the
pattern, and backtracking when needed. By contrast a DFA engine, to my
understanding, constructs a DFA and uses it to munch as many characters
as possible. Maybe it's like this:

Pattern: one(self)?(selfsufficient)?

PYTHON'S NFA ENGINE:

 one self, noneselfsufficient, none
(start)--->((1))>((2))--->((3))

DFA ENGINE:

 one self
(start)--->((123))>((23))
  |
  |
  |  selfsufficient
   --->((3))

I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to get the "longest possible" match with Python's RE module?

2006-09-11 Thread Licheng Fang

Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.

MonkeeSage wrote:
> Licheng Fang wrote:
> > Hi, according to these regexp engine discussions, it's NOT a behavior
> > true to any implementation.
> > [snip]
>
> Well, I just double-checked in ruby (oniguruma regexp engine):
>
> r = Regexp.new("do|dolittle")
> puts r.match("dolittle")[0]
> # do
>
> r = Regexp.new("one(self)?(sufficient)?")
> puts r.match("oneselfsufficient")[0]
> # oneself
>
> And perl:
>
> if ("doolittle" =~
> /(do|dolittle)/) {
>   print "$1\n";
>   # do
> }
>
> if ("oneselfsufficient" =~
> /(one(self)?(selfsufficient)?)/) {
>   print "$1\n";
>   # oneself
> }
>
> And Javascript (whatever regexp engine Spidermonkey uses):
>
> var r = new RegExp(/do|dolittle/);
> alert("dolittle".match(r)[0]);
>
> var r = new RegExp(/one(self)?(selfsufficient)?/);
> alert("oneselfsufficient".match(r)[0]);
>
> So, it seems they are all broken, or python is correct as well.
> 
> Regards,
> Jordan

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to get the "longest possible" match with Python's RE module?

2006-09-11 Thread Licheng Fang

Thank you very much, Tim and Monkee.

In fact, what I'm doing is handle a lot of regular expressions. I
wanted to build VERY LONG regexps part by part and put them all into a
file for easy modification and maintenance. The idea is like this:

(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
... ...

What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution, to get very long regexps to be compiled. I thought that
might be a way to handle long and complex regexps, and in this course I
encountered the problem with the semantics of '|'.

My approach may sound desperate and funny, but please, do you have any
good idea as to how to handle long and complex regexps?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to get the "longest possible" match with Python's RE module?

2006-09-12 Thread Licheng Fang

Bryan Olson wrote:
> Licheng Fang wrote:
> > Oh, please do have a look at the second link I've posted. There's a
> > table comparing the regexp engines. The engines you've tested probably
> > all use an NFA implementation.
>
> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
> book was the first time I saw this confusion about what NFA is;
> I don't know if he originated the mess or just propagated it.
>
> "Nondeterministic finite automata" is well defined in theory
> of computation. The set of languages accepted by NFA's is
> exactly the same as the set accepted by DFA's.
>
> What Python uses is search-and-backtrack. Unfortunately such
> engines don't have much theory behind them, and it's hard to
> reason generally about what they do.
>
>
> --
> --Bryan

Thanks for the valuable information. Indeed, when I read the pages, I
was a little confused about what it meant by 'NFA'.

But I faintly felt, there might be an indirect, and not not very exact
mapping from the search-and-backtrack strategy to NFAs in the sense of
computer science, e.g. a state machine with the capability to be in
several states at one time.

Say, when reading 'oneselfsufficient', the engine goes along the NFA
first to state 1, and then faces the choice between

one self, noneselfsufficient, none
(start)--->((1))>((2))--->((3))

1) matching 'self',
2) going on to state 2 without matching anything, or
3) just give 'one' as the matching result because state 1 is already a
terminal state

In such situations it always chooses the greedy way. To match more, it
goes to the state 2, munching 'self'. And now it's left with only
'sufficient'. Here it's choices are:

1) matching nothing and going to state 3
2) just give 'oneself' as result because state 2 is also a terminal
state

Again it's greedy, going on to state 3, in hope of matching more. But
here the pattern comes to an end, represented by state 3 as a terminal
state. So the engine gives 'oneself' as result and forgets about its
previously unexplored possibilities, because it only performs backtrack
when a match cannot be found.

I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to get the "longest possible" match with Python's RE module?

2006-09-12 Thread Licheng Fang

[EMAIL PROTECTED] wrote:
> kondal wrote:
>
> > This is the way the regexp works python doesn't has anything to do with
> > it. It starts parsing the data with the pattern given. It returns the
> > matched string acording the pattern and doesn't go back to find the
> > other combinations.
>
> I've recently had the same problem in Java, using automatically
> generated regular expressions to find the longest match; I failed on
> cases like matching the whole of "Abcdefg", but also the whole of
> "AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
> No systematic way to deal with these corner cases was available, and
> unsystematic ways (with greedy and reluctant quantifiers) were too
> complex.
> I ended up eliminating regular expressions completely and building a
> dynamic programming parser that returns the set of all match lengths;
> it wasn't hard and it should be even easier in Python.
>
> Lorenzo Gatti

Thanks. I think make use of the expresiveness of CFG may be better
idea.

Another question: my task is to find in a given string the substrings
that satisfies a particular pattern. That's why the first tool that
came to my mind is regular expression. Parsers, however, only give a
yes/no answer to a given string. To find all substrings with a
particular pattern I may have to try every substring, which may be an
impossible task.

How can I solve this problem?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to get the "longest possible" match with Python's RE module?

2006-09-14 Thread Licheng Fang

Thank you guys. I've written a CYK parser and realized this is the
right direction. It gives every possible interpretation of the string
and I can retrieve whatever I want.

-- 
http://mail.python.org/mailman/listinfo/python-list

extremely slow array indexing?

2006-11-30 Thread Grace Fang

Hi,

I am writing code to sort the columns according to the sum of each
column. The dataset is huge (50k rows x 300k cols), so i need to read
line by line and do the summation to avoid the out-of-memory problem.
But I don't know why it runs very slow, and part of the code is as
follows. I suspect it's because of array index, but not sure. Can
anyone
point out what needs to be modified to make it run fast? thanks in
advance!

...
from numpy import *
...

   currSum = zeros(self.componentcount)
   currRow = zeros(self.componentcount)
   for featureDict in self.featureDictList:
   currRow[:] = 0
   for components in self.componentdict1:
   if featureDict.has_key(components):
   col = self.componentdict1[components]
   value = featureDict[components]
   currRow[col]=value;
   currSum = currSum + row;
...

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: extremely slow array indexing?

2006-11-30 Thread Grace Fang

Hi will,Thanks for your reply. The simplified code is as follows, and
you can run it if you like. It takes 7 seconds to process 1000 rows,
which is tolerable, but I wonder why it takes so long, because I also
did one for loop through all of the same rows without accessing array,
which only takes 1 sec to process 1000 rows.  Isn't vectorized
operation supposed to run very quickly?

from numpy import *
componentcount = 30
currSum = zeros(componentcount)
row = zeros(componentcount) #current row
rowcount = 5
for i in range(1,rowcount):
row[:] = 1
currSum = currSum + row;

> Array indexing is unlikely to be the culprit. Could it not just be
> slow, because you are processing a lot of data? With numbers those big
> I would expect to have enough time to go make a coffee, then drink it.
>
> If you think it is slower than it could be, post more code for
> optimization advice...
> 
> Will McGugan
> --
> http://www.willmcgugan.com

-- 
http://mail.python.org/mailman/listinfo/python-list

How to configure proxy authentication when using urllib2?

2006-02-19 Thread Licheng Fang

I use a HTTP proxy to connect to Internet. When I run ulropen command I
get HTTP Error 407: Proxy authorization required. Could anybody tell me
how to resolve this? Thanks!

-- 
http://mail.python.org/mailman/listinfo/python-list

Problem of Readability of Python

2007-10-07 Thread Licheng Fang

Python is supposed to be readable, but after programming in Python for
a while I find my Python programs can be more obfuscated than their C/C
++ counterparts sometimes. Part of the reason is that with
heterogeneous lists/tuples at hand, I tend to stuff many things into
the list and *assume* a structure of the list or tuple, instead of
declaring them explicitly as one will do with C structs. So, what used
to be

struct nameval {
char * name;
   int val;
} a;

a.name = ...
a.val = ...

becomes cryptic

a[0] = ...
a[1] = ...

Python Tutorial says an empty class can be used to do this. But if
namespaces are implemented as dicts, wouldn't it incur much overhead
if one defines empty classes as such for some very frequently used
data structures of the program?

Any elegant solutions?

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem of Readability of Python

2007-10-10 Thread Licheng Fang

On Oct 8, 4:24 am, [EMAIL PROTECTED] (Alex Martelli) wrote:
> Licheng Fang <[EMAIL PROTECTED]> wrote:
>
>...
>
> > Python Tutorial says an empty class can be used to do this. But if
> > namespaces are implemented as dicts, wouldn't it incur much overhead
> > if one defines empty classes as such for some very frequently used
> > data structures of the program?
>
> Just measure:
>
> $ python -mtimeit -s'class A(object):pass' -s'a=A()' 'a.zop=23'
> 100 loops, best of 3: 0.241 usec per loop
>
> $ python -mtimeit -s'a=[None]' 'a[0]=23'
> 1000 loops, best of 3: 0.156 usec per loop
>
> So, the difference, on my 18-months-old laptop, is about 85 nanoseconds
> per write-access; if you have a million such accesses in a typical run
> of your program, it will slow the program down by about 85 milliseconds.
> Is that "much overhead"?  If your program does nothing else except those
> accesses, maybe, but then why are your writing that program AT ALL?-)
>
> And yes, you CAN save about 1/3 of those 85 nanoseconds by having
> '__slots__=["zop"]' in your class A(object)... but that's the kind of
> thing one normally does only to tiny parts of one's program that have
> been identified by profiling as dramatic bottlenecks, to shave off the
> last few nanoseconds in the very last stages of micro-optimization of a
> program that's ALMOST, but not QUITE, fast enough... knowing about such
> "extreme last-ditch optimization tricks" is of very doubtful value (and
> I think I'm qualified to say that, since I _do_ know many of them...:-).
> There ARE important performance things to know about Python, but those
> worth a few nanoseconds don't matter much.
>
> Alex

This is enlightening. Surely I shouldn't have worried too much about
performance before doing some measurement.

-- 
http://mail.python.org/mailman/listinfo/python-list

Accessing Function Variables from Sub-functions

2007-10-11 Thread Licheng Fang

On Apr 14 2003, 10:30 pm, Alex Martelli <[EMAIL PROTECTED]> wrote:
> Sebastian Wilhelmi wrote:
> > Hi,
>
> > I would like to do the following:
>
> > ---8<---8<---8<---8<---
> > def test ():
> > count = 0
> > def inc_count ():
> > count += 1
> > inc_count ()
> > inc_count ()
> > print count
>
> > test ()
> > ---8<---8<---8<---8<---
>
> > This doesn't work (and I even understand, why ;-)
>
> Specifically: a nested function cannot *RE-BIND* a variable of
> an outer function.
>

Sorry to dig up this old thread, but I would like to know what's the
rationale is. Why can't a nested function rebind a variable of an
outer function?

>
>
> > One solution is the following, which I however do not see as very
> > clean or nice.
>
> > ---8<---8<---8<---8<---
> > def test ():
> > count = [0]
> > def inc_count ():
> > count[0] += 1
> > inc_count ()
> > inc_count ()
> > print count[0]
>
> > test ()
> > ---8<---8<---8<---8<---
>
> > Now my question: Is there some way to achieve this with a nicer
> > syntax?
>
> Depends on your tastes in syntax, e.g.:
>
> def test():
> class Bunch: pass
> loc = Bunch()
> loc.count = 0
> def inc_count():
> loc.count += 1
> inc_count()
> inc_count()
> print loc.count
>
> or:
>
> def test():
> test.count = 0
> def inc_count():
> test.count += 1
> inc_count()
> inc_count()
> print test.count
>
> and no doubt quite a few others.

I was trying to write a function that creates another function and
returns it when I came across this problem. These two solutions have a
difference:

def M():
M.c = 0
class Bunch:
pass
Bunch.c = 0
def f():
M.c += 1
Bunch.c += 1
print M.c, Bunch.c
return f

>>>f = M()
>>>f2 = M()
>>> f()
1 1
>>> f()
2 2
>>> f()
3 3
>>> f2()
4 1
>>> f2()
5 2
>>> f2()
6 3

The created functions share their variables binded to the outer
function, but have their separate copies of variables bundled in a
class.

Is binding name to the function object a python way to use 'static'
variables? But how to initialize them?

>
> > Using 'global' would not count, as that would make a variable
> > unnecessarily known to the outside.
>
> The two solutions above outlined differ in this respect -- variable
> 'loc' in the former is local to function test, while function attribute
> test.count in the latter is "known to the outside" (it survives each
> execution of test and can be examined "from the outside").
>
> Class Bunch (or some highly refined version thereof) IS typically
> around whenever I program, see for example:
>
> http://mail.python.org/pipermail/python-list/2002-July/112007.html
>
> for a typical presentation.  Having available the MetaBunch there
> described, I'd start the function as follows:
>
> def test():
> class Bunch(MetaBunch): count=0
> loc = MetaBunch()
> def inc_count():
> loc.count += 1
>
> etc.  I do find such constructs often handy...
>
> Alex

-- 
http://mail.python.org/mailman/listinfo/python-list

How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Licheng Fang

I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.

Thank you.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Licheng Fang

I find myself frequently in need of classes like this for two reasons.
First, it's efficient in memory. Second, when two instances are
compared for equality only their pointers are compared. (I think
that's how Python compares 'str's.

On Nov 24, 6:31 pm, Licheng Fang <[EMAIL PROTECTED]> wrote:
> I mean, all the class instances that equal to each other should be
> reduced into only one instance, which means for instances of this
> class there's no difference between a is b and a==b.
>
> Thank you.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Licheng Fang

On Nov 24, 7:05 pm, Bjoern Schliessmann  wrote:
> Licheng Fang wrote:
> > I find myself frequently in need of classes like this for two
> > reasons. First, it's efficient in memory.
>
> Are you using millions of objects, or MB size objects? Otherwise,
> this is no argument.

Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:

trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

I can save the counts in a dict D1. Later, I may want to recount the
trigrams, with some minor modifications, say, doing it on every other
line of the input file, and the counts are saved in dict D2. Problem
is, D1 and D2 have almost the same set of keys (trigrams of the text),
yet the keys in D2 are new instances, even though these keys probably
have already been inserted into D1. So I end up with unnecessary
duplicates of keys. And this can be a great waste of memory with huge
input data.

>
> BTW, what happens if you, by some operation, make a == b, and
> afterwards change b so another object instance must be created?
> This instance management is quite a runtime overhead.
>

I probably need this class to be immutable.

> > Second, when two instances are compared for equality only their
> > pointers are compared.
>
> I state that the object management will often eat more performance
> than equality testing. Except you have a huge number of equal
> objects. If the latter was the case you should rethink your program
> design.
>

Yeah, counting is all about equal or not.

> > (I think that's how Python compares 'str's.
>
> Generally not. In CPython, just very short strings are created only
> once.
>
> >>> a=" "
> >>> b=" "
> >>> a is b
> True
> >>> a="  "
> >>> b="  "
> >>> a is b
>

Wow, I didn't know this. But exactly how Python manage these strings?
My interpretator gave me such results:

>>> a = 'this'
>>> b = 'this'
>>> a is b
True
>>> a = 'this is confusing'
>>> b = 'this is confusing'
>>> a is b
False

> False
>
> Regards,
>
> Björn
>
> --
> BOFH excuse #430:
>
> Mouse has out-of-cheese-error

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How can I create customized classes that have similar properties as 'str'?

2007-11-24 Thread Licheng Fang

On Nov 24, 9:42 pm, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
> On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote:
> > Licheng Fang wrote:
> >> On Nov 24, 7:05 pm, Bjoern Schliessmann 
> >> Wow, I didn't know this. But exactly how Python manage these
> >> strings?
>
> > I don't know (use the source, Luke). :) Or perhaps there is a Python
> > Elder here that knows?
>
> AFAIK strings of length 1 and strings that would be valid Python
> identifiers are treated this way.
>
> Ciao,
> Marc 'BlackJack' Rintsch

Thanks. Then, is there a way to make python treat all strings this
way, or any other kind of immutable objects?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How can I create customized classes that have similar properties as 'str'?

2007-11-25 Thread Licheng Fang

On Nov 25, 5:59 am, Steven D'Aprano <[EMAIL PROTECTED]
cybersource.com.au> wrote:
> On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
> > On Nov 24, 7:05 pm, Bjoern Schliessmann  > [EMAIL PROTECTED]> wrote:
> >> Licheng Fang wrote:
> >> > I find myself frequently in need of classes like this for two
> >> > reasons. First, it's efficient in memory.
>
> >> Are you using millions of objects, or MB size objects? Otherwise, this
> >> is no argument.
>
> > Yes, millions.
>
> Oh noes!!! Not millions of words That's like, oh, a few tens of
> megabytes1! How will a PC with one or two gigabytes of RAM cope?
>
> Tens of megabytes is not a lot of data.
>
> If the average word size is ten characters, then one million words takes
> ten million bytes, or a little shy of ten megabytes. Even if you are
> using four-byte characters, you've got 40 MB, still a moderate amount of
> data on a modern system.

I mentioned trigram counting as an illustrative case. In fact, you'll
often need to define patterns more complex than that, and tens of
megabytes of text may generate millions of them, and I've observed
they quickly  ate up the 8G memory of a workstation in a few minutes.
Manipulating these patterns can be tricky, you can easily waste a lot
of memory without taking extra care. I just thought if I define my
pattern class with this 'atom' property, coding efforts could be
easier later.

>
> > In my natural language processing tasks, I almost always
> > need to define patterns, identify their occurrences in a huge data, and
> > count them. Say, I have a big text file, consisting of millions of
> > words, and I want to count the frequency of trigrams:
>
> > trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]
>
> > I can save the counts in a dict D1. Later, I may want to recount the
> > trigrams, with some minor modifications, say, doing it on every other
> > line of the input file, and the counts are saved in dict D2. Problem is,
> > D1 and D2 have almost the same set of keys (trigrams of the text), yet
> > the keys in D2 are new instances, even though these keys probably have
> > already been inserted into D1. So I end up with unnecessary duplicates
> > of keys. And this can be a great waste of memory with huge input data.
>
> All these keys will almost certainly add up to only a few hundred
> megabytes, which is a reasonable size of data but not excessive. This
> really sounds to me like a case of premature optimization. I think you
> are wasting your time solving a non-problem.
>
> [snip]
>
> > Wow, I didn't know this. But exactly how Python manage these strings? My
> > interpretator gave me such results:
>
> >>>> a = 'this'
> >>>> b = 'this'
> >>>> a is b
> > True
> >>>> a = 'this is confusing'
> >>>> b = 'this is confusing'
> >>>> a is b
> > False
>
> It's an implementation detail. You shouldn't use identity testing unless
> you actually care that two names refer to the same object, not because
> you want to save a few bytes. That's poor design: it's fragile,
> complicated, and defeats the purpose of using a high-level language like
> Python.
>
> --
> Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How can I create customized classes that have similar properties as 'str'?

2007-11-27 Thread Licheng Fang

On Nov 27, 10:45 am, Steven D'Aprano
<[EMAIL PROTECTED]> wrote:
> On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:
> > I mentioned trigram counting as an illustrative case. In fact, you'll
> > often need to define patterns more complex than that, and tens of
> > megabytes of text may generate millions of them, and I've observed they
> > quickly  ate up the 8G memory of a workstation in a few minutes.
> > Manipulating these patterns can be tricky, you can easily waste a lot of
> > memory without taking extra care. I just thought if I define my pattern
> > class with this 'atom' property, coding efforts could be easier later.
>
> I'm just not getting the same results as you when I try this. I'm finding
> that with no extra effort at all, it just works.
>
> The size of your corpus is not important. Neither is the complexity of
> how you generate the patterns. What's important is the number of patterns
> you produce, and "millions" isn't that huge a number, even without
> interning the strings.
>
> Obviously I'm not running your code, but I can build a dict with millions
> of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
> memory and not run into any serious problems.
>
> I've just processed roughly 240MB of random emails, generating n-grams up
> to length 5. The emails include binary attachments and HTML etc., so I'm
> getting lots of patterns that don't normally exist in natural languages
> (e.g. 71 occurrences of 'qqq', and 5 of ''). As I said, my PC has
> only 1GB, and that's being shared with about a dozen other apps (including
> such memory hogs as Firefox).
>
> Results? 64939962 patterns found, of which 17031467 are unique. There's
> paging, yes, and my PC runs a little slow when I try to switch from one
> running application to another, but nothing unusable. Opening a dozen
> YouTube videos at once impacts performance worse.
>
> I can't think what you're doing to use up 8GB of RAM for merely
> "millions" of strings, even if you are keeping two, three, ten redundant
> copies. Assuming an average length of twenty bytes per pattern (you said
> trigrams, but I'd rather over-estimate than under), and even assuming
> that only half the 8GB are available to Python, you should be able to
> store something of the order of one hundred million patterns easily:

My task is identifying sequences of tokens (phrases) that are possible
tranlations of each other from a bilingual corpus. I need to check all
the subsequences of a sentence to get the possible phrase pairs. This
makes the problem different from n-gram counting in that the number of
possible phrases doesn't grow linearly with n, but approximately with
n^2. (n being the sentence length.) My first version of the program
consumes almost twice as much memory as the current one because I
discovered in doing different statistics I was regenerating the
patterns, and the counting dictionaries ended up with duplicated
pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
can restrict the pattern class to not generate identical instances? So
I can avoid such subtle but significant bugs.

>
> 4 bytes for a pointer plus 20 bytes for the string = 24 bytes
>
> 4*1024**3 / 24 = 178,956,970
>
> (This is a ballpark figure. Real Python strings will have additional
> overhead.)



>
> If you're running into problems with merely millions of patterns, then
> you're doing worse by probably two orders of magnitude.
>
> I don't think that the problem lies where you think it does. If you have
> a dict with millions of keys, it doesn't matter how many times each
> pattern exists in the corpus because the key only exists *once* in the
> dict. Duplicate the dict, or generate it from scratch even, and at worse
> you double the memory used by the keys. And probably not even that.
>
> The only thing I can think of that might explain why you're using so much
> memory is if you are generating *all* the patterns up front, say in a
> list, before adding them to the dict:
>
> # Generate one massive list of patterns containing many duplicates
> patterns = make_patterns(text)
> # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
> d = {}
> for pattern in patterns:
> d[pattern] = d.get(pattern, 0) + 1
>

No, I wasn't doing that.
BTW, do you think the pattern counting code can avoid hashing the
pattern twice? Is there a way to do that when the dictionary values
are of a primitive type?

> Notice that the real killer in the above algorithm is that you need
> enough storage, not just for the

Nested Lists Assignment Problem

2006-04-26 Thread Licheng Fang

I wanna use nested lists as an array, but here's the problem:

>>> a = [[0]*3]*3
>>> a
[[0, 0, 0], [0, 0, 0], [0, 0, 0]]
>>> a[0][0] = 1
>>> a
[[1, 0, 0], [1, 0, 0], [1, 0, 0]]

Could anybody please explain to me why three values were change? I'm
bewildered. Thanks!

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Nested Lists Assignment Problem

2006-04-26 Thread Licheng Fang

Dennis Lee Bieber wrote:
> On 26 Apr 2006 01:13:20 -0700, "Licheng Fang" <[EMAIL PROTECTED]>
> declaimed the following in comp.lang.python:
>
>
> >
> > Could anybody please explain to me why three values were change? I'm
> > bewildered. Thanks!
>
> http://www.aifb.uni-karlsruhe.de/Lehrangebot/Winter2000-01/E-Shop/Python/Doku/The%20Whole%20Python%20FAQ.htm#4.50
> --
>  > == <
>  >   [EMAIL PROTECTED]  | Wulfraed  Dennis Lee Bieber  KD6MOG <
>  >  [EMAIL PROTECTED] |   Bestiaria Support Staff   <
>  > == <
>  >   Home Page: <http://www.dm.net/~wulfraed/><
>  >Overflow Page: <http://wlfraed.home.netcom.com/><

Thank you very much!

But I still wonder why a nested assignment "a = [[0]*3]*3" generates 3
references to the same list, while the commands below apparently do
not.

>>> a = [0] * 2
>>> a
[0, 0]
>>> a[0] = 1
>>> a
[1, 0]

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: how to debug python's extend module written in c/c++ on windows

Re: how to debug python's extend module written in c/c++ on windows

how to cut and paste in the windows of python(command line)

Re: how to cut and paste in the windows of python(command line)

Re: Python GTK import error

How to get the type of an object?

Re: How to get the type of an object?

Python and STL efficiency

Re: Python and STL efficiency

Re: Modules... paths... newbie confusion

How to get the "longest possible" match with Python's RE module?

Re: How to get the "longest possible" match with Python's RE module?

Re: How to get the "longest possible" match with Python's RE module?

Re: How to get the "longest possible" match with Python's RE module?

Re: How to get the "longest possible" match with Python's RE module?

Re: How to get the "longest possible" match with Python's RE module?

Re: How to get the "longest possible" match with Python's RE module?

extremely slow array indexing?

Re: extremely slow array indexing?

How to configure proxy authentication when using urllib2?

Problem of Readability of Python

Re: Problem of Readability of Python

Accessing Function Variables from Sub-functions

How can I create customized classes that have similar properties as 'str'?

Re: How can I create customized classes that have similar properties as 'str'?

Re: How can I create customized classes that have similar properties as 'str'?

Re: How can I create customized classes that have similar properties as 'str'?

Re: How can I create customized classes that have similar properties as 'str'?

Re: How can I create customized classes that have similar properties as 'str'?

Nested Lists Assignment Problem

Re: Nested Lists Assignment Problem

31 matches

Site Navigation

Mail list logo

Footer information