Sania writes: > On Apr 19, 2:48 am, Jussi Piitulainen <jpiit...@ling.helsinki.fi> > wrote: > > Sania writes: > > > So I am trying to get the number of casualties in a text. After 'death > > > toll' in the text the number I need is presented as you can see from > > > the variable called text. Here is my code > > > I'm pretty sure my regex is correct, I think it's the group part > > > that's the problem. > > > I am using nltk by python. Group grabs the string in parenthesis and > > > stores it in deadnum and I make deadnum into a list. > > > > > text="accounts put the death toll at 637 and those missing at > > > 653 , but the total number is likely to be much bigger" > > > dead=re.match(r".*death toll.*(\d[,\d\.]*)", text) > > > deadnum=dead.group(1) > > > deaths.append(deadnum) > > > print deaths > > > > It's the regexp. The .* after "death toll" each the input as far as it > > can without making the whole match fail. The group matches only the > > last digit in the text. > > > > You could allow only non-digits before the number. Or you could look > > up the variant of * that only matches as much as it must. > > Hey Thanks, > So now my regex is > > dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text) > > But I only find 7 not 657. How is it that the group is only matching > the last digit? The whole thing is parenthesis not just the last > part. ?
It's still consuming the digits among the text that comes _before_ the parenthesised group: the .{0,20} matches as _much_ as it _can_ without making the whole regex fail, and the . in it matches also digits. Try \D{0,20} to limit its matching ability to non-digits. Try \.{0,20}? to limit to it to matching as _little_ as it can. (The variant of * I referred to is *?; {} and {}? are similar.) The simplicity of regexen is deceptive. Be careful. Be surprised. <http://docs.python.org/library/re.html>. Keep them simple. Consider also other means instead or in addition. -- http://mail.python.org/mailman/listinfo/python-list