Re: help make it faster please

2005-11-13 Thread Ron Adam
Fredrik Lundh wrote: > Ron Adam wrote: > > >>The \w does make a small difference, but not as much as I expected. > > > that's probably because your benchmark has a lot of dubious overhead: I think it does what the OP described, but that may not be what he really needs. Although the test to

Re: help make it faster please

2005-11-13 Thread Fredrik Lundh
Ron Adam wrote: > The \w does make a small difference, but not as much as I expected. that's probably because your benchmark has a lot of dubious overhead: > word_finder = re.compile('[EMAIL PROTECTED]', re.I) no need to force case-insensitive search here; \w looks for both lower- and uppercase

Re: help make it faster please

2005-11-13 Thread Ron Adam
Fredrik Lundh wrote: > Lonnie Princehouse wrote: > > >>"[a-z0-9_]" means "match a single character from the set {a through z, >>0 through 9, underscore}". > > > "\w" should be a bit faster; it's equivalent to "[a-zA-Z0-9_]" (unless you > specify otherwise using the locale or unicode flags), b

Re: help make it faster please

2005-11-13 Thread Sybren Stuvel
Bengt Richter enlightened us with: > I meant somestring.split() just like that -- without a splitter > argument. My suspicion remains ;-) Mine too ;-) Sybren -- The problem with the world is stupidity. Not saying there should be a capital punishment for stupidity, but why don't we just take the

Re: help make it faster please

2005-11-12 Thread Bengt Richter
On Sat, 12 Nov 2005 10:46:53 +0100, Sybren Stuvel <[EMAIL PROTECTED]> wrote: >Bengt Richter enlightened us with: >> I suspect it's not possible to get '' in the list from >> somestring.split() > >Time to adjust your suspicions: > ';abc;'.split(';') >['', 'abc', ''] I know about that one ;-) I

Re: help make it faster please

2005-11-12 Thread bearophileHUGS
Thank you Bengt Richter and Sybren Stuvel for your comments, my little procedure can be improved a bit in many ways, it was just a first quickly written version (but it can be enough for a basic usage). Bengt Richter: >good way to prepare for split Maybe there is a better way, that is putting in

Re: help make it faster please

2005-11-12 Thread Sybren Stuvel
Bengt Richter enlightened us with: > I suspect it's not possible to get '' in the list from > somestring.split() Time to adjust your suspicions: >>> ';abc;'.split(';') ['', 'abc', ''] >>countDict[w] += 1 >>else: >>countDict[w] = 1 > does t

Re: help make it faster please

2005-11-11 Thread Bengt Richter
On 10 Nov 2005 10:43:04 -0800, [EMAIL PROTECTED] wrote: >This can be faster, it avoids doing the same things more times: > >from string import maketrans, ascii_lowercase, ascii_uppercase > >def create_words(afile): >stripper = """'[",;<>{}_&?!():[]\.=+-*\t\n\r^%0123456789/""" >mapper = mak

Re: help make it faster please

2005-11-11 Thread Sion Arrowsmith
<[EMAIL PROTECTED]> wrote: >Oh sorry indentation was messed here...the >wordlist = countDict.keys() >wordlist.sort() >should be outside the word loop now >def create_words(lines): >cnt = 0 >spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+' >for content in lines: >words=content.spl

Re: help make it faster please

2005-11-10 Thread Fredrik Lundh
Lonnie Princehouse wrote: > "[a-z0-9_]" means "match a single character from the set {a through z, > 0 through 9, underscore}". "\w" should be a bit faster; it's equivalent to "[a-zA-Z0-9_]" (unless you specify otherwise using the locale or unicode flags), but is handled more efficiently by the R

Re: help make it faster please

2005-11-10 Thread Lonnie Princehouse
The word_finder regular expression defines what will be considered a word. "[a-z0-9_]" means "match a single character from the set {a through z, 0 through 9, underscore}". The + means "match as many as you can, minimum of one" To match @ as well, add it to the set of characters to match: wor

Re: help make it faster please

2005-11-10 Thread pkilambi
ok this sounds much better..could you tell me what to do if I want to leave characters like @ in words.So I would like to consider this as a part of word -- http://mail.python.org/mailman/listinfo/python-list

Re: help make it faster please

2005-11-10 Thread pkilambi
Actually I create a seperate wordlist for each so called line.Here line I mean would be a paragraph in future...so I will have to recreate the wordlist for each loop -- http://mail.python.org/mailman/listinfo/python-list

Re: help make it faster please

2005-11-10 Thread Larry Bates
[EMAIL PROTECTED] wrote: > I wrote this function which does the following: > after readling lines from file.It splits and finds the word occurences > through a hash table...for some reason this is quite slow..can some one > help me make it faster... > f = open(filename) > lines = f.readlines() >

Re: help make it faster please

2005-11-10 Thread bearophileHUGS
This can be faster, it avoids doing the same things more times: from string import maketrans, ascii_lowercase, ascii_uppercase def create_words(afile): stripper = """'[",;<>{}_&?!():[]\.=+-*\t\n\r^%0123456789/""" mapper = maketrans(stripper + ascii_uppercase, " "*le

Re: help make it faster please

2005-11-10 Thread Lonnie Princehouse
You're making a new countDict for each line read from the file... is that what you meant to do? Or are you trying to count word occurrences across the whole file? -- In general, any time string manipulation is going slowly, ask yourself, "Can I use the re module for this?" # disclaimer: unteste

Re: help make it faster please

2005-11-10 Thread [EMAIL PROTECTED]
don't know your intend so have no idea what it is for. However, you are doing : wordlist=contDict.keys() wordlist.sort() for every word processed yet you don't use the content of x in anyway during the loop. Even if you need one fresh snapshot of contDict after each word, I don't see the need fo

Re: help make it faster please

2005-11-10 Thread pkilambi
Oh sorry indentation was messed here...the wordlist = countDict.keys() wordlist.sort() should be outside the word loop now def create_words(lines): cnt = 0 spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+' for content in lines: words=content.split() countDict={} wor

Re: help make it faster please

2005-11-10 Thread [EMAIL PROTECTED]
why reload wordlist and sort it after each word processing ? seems that it can be done after the for loop. [EMAIL PROTECTED] wrote: > I wrote this function which does the following: > after readling lines from file.It splits and finds the word occurences > through a hash table...for some reason

help make it faster please

2005-11-10 Thread pkilambi
I wrote this function which does the following: after readling lines from file.It splits and finds the word occurences through a hash table...for some reason this is quite slow..can some one help me make it faster... f = open(filename) lines = f.readlines() def create_words(lines): cnt = 0