Fredrik Lundh wrote:
> Ron Adam wrote:
>
>
>>The \w does make a small difference, but not as much as I expected.
>
>
> that's probably because your benchmark has a lot of dubious overhead:
I think it does what the OP described, but that may not be what he
really needs.
Although the test to
Ron Adam wrote:
> The \w does make a small difference, but not as much as I expected.
that's probably because your benchmark has a lot of dubious overhead:
> word_finder = re.compile('[EMAIL PROTECTED]', re.I)
no need to force case-insensitive search here; \w looks for both lower-
and uppercase
Fredrik Lundh wrote:
> Lonnie Princehouse wrote:
>
>
>>"[a-z0-9_]" means "match a single character from the set {a through z,
>>0 through 9, underscore}".
>
>
> "\w" should be a bit faster; it's equivalent to "[a-zA-Z0-9_]" (unless you
> specify otherwise using the locale or unicode flags), b
Bengt Richter enlightened us with:
> I meant somestring.split() just like that -- without a splitter
> argument. My suspicion remains ;-)
Mine too ;-)
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
On Sat, 12 Nov 2005 10:46:53 +0100, Sybren Stuvel <[EMAIL PROTECTED]> wrote:
>Bengt Richter enlightened us with:
>> I suspect it's not possible to get '' in the list from
>> somestring.split()
>
>Time to adjust your suspicions:
>
';abc;'.split(';')
>['', 'abc', '']
I know about that one ;-)
I
Thank you Bengt Richter and Sybren Stuvel for your comments, my little
procedure can be improved a bit in many ways, it was just a first
quickly written version (but it can be enough for a basic usage).
Bengt Richter:
>good way to prepare for split
Maybe there is a better way, that is putting in
Bengt Richter enlightened us with:
> I suspect it's not possible to get '' in the list from
> somestring.split()
Time to adjust your suspicions:
>>> ';abc;'.split(';')
['', 'abc', '']
>>countDict[w] += 1
>>else:
>>countDict[w] = 1
> does t
On 10 Nov 2005 10:43:04 -0800, [EMAIL PROTECTED] wrote:
>This can be faster, it avoids doing the same things more times:
>
>from string import maketrans, ascii_lowercase, ascii_uppercase
>
>def create_words(afile):
>stripper = """'[",;<>{}_&?!():[]\.=+-*\t\n\r^%0123456789/"""
>mapper = mak
<[EMAIL PROTECTED]> wrote:
>Oh sorry indentation was messed here...the
>wordlist = countDict.keys()
>wordlist.sort()
>should be outside the word loop now
>def create_words(lines):
>cnt = 0
>spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
>for content in lines:
>words=content.spl
Lonnie Princehouse wrote:
> "[a-z0-9_]" means "match a single character from the set {a through z,
> 0 through 9, underscore}".
"\w" should be a bit faster; it's equivalent to "[a-zA-Z0-9_]" (unless you
specify otherwise using the locale or unicode flags), but is handled more
efficiently by the R
The word_finder regular expression defines what will be considered a
word.
"[a-z0-9_]" means "match a single character from the set {a through z,
0 through 9, underscore}".
The + means "match as many as you can, minimum of one"
To match @ as well, add it to the set of characters to match:
wor
ok this sounds much better..could you tell me what to do if I want to
leave characters like @ in words.So I would like to consider this as a
part of word
--
http://mail.python.org/mailman/listinfo/python-list
Actually I create a seperate wordlist for each so called line.Here line
I mean would be a paragraph in future...so I will have to recreate the
wordlist for each loop
--
http://mail.python.org/mailman/listinfo/python-list
[EMAIL PROTECTED] wrote:
> I wrote this function which does the following:
> after readling lines from file.It splits and finds the word occurences
> through a hash table...for some reason this is quite slow..can some one
> help me make it faster...
> f = open(filename)
> lines = f.readlines()
>
This can be faster, it avoids doing the same things more times:
from string import maketrans, ascii_lowercase, ascii_uppercase
def create_words(afile):
stripper = """'[",;<>{}_&?!():[]\.=+-*\t\n\r^%0123456789/"""
mapper = maketrans(stripper + ascii_uppercase,
" "*le
You're making a new countDict for each line read from the file... is
that what you meant to do? Or are you trying to count word occurrences
across the whole file?
--
In general, any time string manipulation is going slowly, ask yourself,
"Can I use the re module for this?"
# disclaimer: unteste
don't know your intend so have no idea what it is for. However, you are
doing :
wordlist=contDict.keys()
wordlist.sort()
for every word processed yet you don't use the content of x in anyway
during the loop. Even if you need one fresh snapshot of contDict after
each word, I don't see the need fo
Oh sorry indentation was messed here...the
wordlist = countDict.keys()
wordlist.sort()
should be outside the word loop now
def create_words(lines):
cnt = 0
spl_set = '[",;<>{}_&?!():-[\.=+*\t\n\r]+'
for content in lines:
words=content.split()
countDict={}
wor
why reload wordlist and sort it after each word processing ? seems that
it can be done after the for loop.
[EMAIL PROTECTED] wrote:
> I wrote this function which does the following:
> after readling lines from file.It splits and finds the word occurences
> through a hash table...for some reason
I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason this is quite slow..can some one
help me make it faster...
f = open(filename)
lines = f.readlines()
def create_words(lines):
cnt = 0
20 matches
Mail list logo