On Fri, Nov 21, 2008 at 9:12 PM, scsoce <[EMAIL PROTECTED]> wrote: > MRAB wrote: > >> <div class="moz-text-flowed" style="font-family: -moz-fixed">Steve Holden >> wrote: >> >>> Please keep this on the list. >>> >>> scsoce wrote: >>> >>>> Steve Holden wrote: >>>> >>>>> scsoce wrote: >>>>> >>>>> >>>>>> say, when I try to search and match every char from variable length >>>>>> string, such as string '123456', i tried re.findall( r'(\d)*, '12346' >>>>>> ) >>>>>> >>>>>> >>>>> I think you will find you missed a quote out there. Always better to >>>>> copy and paste ... >>>>> >>>>> >>>>> >>>>>> , but only get '6' and Python doc indeed say: "If a group is contained >>>>>> in a part of the pattern that matched multiple times, the last match >>>>>> is >>>>>> returned." >>>>>> >>>>>> >>>>> So use >>>>> >>>>> r'(\d*)' >>>>> >>>>> instead and then the group includes all the digits you match. >>>>> >>>>> >>>>> >>>>>> cause the regx engine cannot remember all the past history then ? is >>>>>> it >>>>>> nature to all regx engine or only to Python ? >>>>>> >>>>>> >>>>> Different regex engines have different capabilities, so I can't speak >>>>> to >>>>> them all. If you wanted *all* the matches of *all* groups, how would >>>>> you >>>>> have them returned? As a list? That would make the case where there was >>>>> only one match much tricker to handle. And what would you do with >>>>> >>>>> r'((\w)*\d)*)' >>>>> >>>>> Also, what about named groups? I can see enough potential >>>>> implementation >>>>> issues that I can perfectly understand why Python works the way it >>>>> does, >>>>> so I'd be interested to know why it doesn't makes sense to you, and >>>>> what >>>>> you would prefer it to do. >>>>> >>>>> regards >>>>> Steve >>>>> >>>>> >>>> maybe my expression was not clear. I want to capture every matched part >>>> in a repeated pattern, not only the last, say, for string '123456', I >>>> want to back reference any one char, not only the '6'. and i know the >>>> example is very simple, so we can got the whole string using regx and >>>> get every char using other python statements, but if the pattern in >>>> group is complex? >>>> and I test in VIM, it can do the 'back reference': >>>> ==you text in vim: >>>> 123456 >>>> == pattern: >>>> :%s/\(\d\)*/$2 >>>> text will turn to be: >>>> 2 >>>> >>>> 'Fraid the Python re implementers just decided not to do it that way. >>> >>> Nor Perl. >> >> Probably what you want is re.findall(r"(\d)", "123456"), which returns a >> list of what it captured. >> >> >> </div> >> > Yes, you are right, but this way findall() capture only the 'top' group. > What I really need to do is to capture nested and repated patterns, say, > <table> tag in html contains many <tr>, <tr> contains many <td>, the > data in <td> is i need, so I write the regx like this: > regx =''' > <table.*\n > ( > (\s*<tr.*\n > (\s*<td.*</td>\n|\n)* > \s*</tr>\n > |\n)* > ) > \s*</table> > ''' > Steve Holden wrote: > >> I can see enough potential implementation >> issues that I can perfectly understand why Python works the way it does, >> so I'd be interested to know why it doesn't makes sense to you, and what >> you would prefer it to do. >> >> > > As Steve said, if re really cannot do this kind of work , so I have to > split the one line regx down, and capture <table> first, and then loop to > catpure <tr>, and then <td>, and so on ... . I donnot like this way compared > with the above one clean regx line. > > > -- > http://mail.python.org/mailman/listinfo/python-list >
If you're parsing structured markup like HTML, why not use something meant for that? I personally find BeautifulSoup ( http://www.crummy.com/software/BeautifulSoup/) to be very good at this. For instance, here's a code snippet I recently used to pull out specific data from a table in a site: soup = BeautifulSoup(some_page) opts = [fonttag.string.strip() for row in soup('tr', attrs={'class':'targetClass'}) for cell in row('td') for fonttag in cell('font') if cell('font')]
-- http://mail.python.org/mailman/listinfo/python-list