Really thanks for quickly reply Chris! Actually I tried BeautifulSoup and it's great. But I'm not very familiar with it and it need more codes to parse the html and get the right text. I think regexp is more convenient if there is a way to filter out the list just in one line:) I did this all the way but stopped here...
On 11/7/08, Chris Rebert <[EMAIL PROTECTED]> wrote: > > On Thu, Nov 6, 2008 at 11:06 PM, <[EMAIL PROTECTED]> wrote: > > I always have no idea about how to express "conclude the entire word" > > with regexp, while using python, I encountered this problem again... > > > > for example, if I want to match the "string" in "test a string", > > re.findall(r"[^a]* (\w+)","test a string") will work, but what if > > there is not "a" but "an"(test a string)? the [^an] will failed > > because it will stop at the first character "a". > > > > I guess people not always use this kind of way to filter words? > > Here comes the real problem I encountered: > > I want to filter the text both in "<td>" block and the "<span>"'s > > title attribute > > Is there any particularly good reason why you're using regexps for > this rather than, say, an actual (X)HTML parser? > > Cheers, > Chris > -- > Follow the path of the Iguana... > http://rebertia.com > > > ###################### code ############################# > > import re > > content='''<tr align="center" valign="middle" class="CellCss"><td > > valign="middle">LA</td><td valign="middle">11/10/2008</td><td > > valign="middle">1340/1430</td><td valign="middle">PF1/5</td><td > > valign="middle"><span title="Understanding the stock market" > > class="MouseCursor">Understand....</span></td><td title="Charisma" > > valign="middle">Charisma</td><td valign="middle">Booked</td><td > > valign="middle">''' > > > > re.findall(r'''<td valign="middle">([^<]+)</td><td > > valign="middle">([^<]+)</td><td valign="middle">([^<]+)</td><td > > valign="middle">([^<]+)</td><td valign="middle"><span > > title="([^"]*)"''',content) > > > > #################### code end ############################ > > As you saw above, > > I get the results with "LA,11/10/2008,1340/1430,PF1/5,Understanding > > the stock market" > > there are two "<span>" block but I can just get the "title" attribute > > of the first "<span>" using regexp. > > for the second, which should be "Charisma" I need to use some kind of > > [^</td>]* to match "class="MouseCursor">Understand....</span></td>", > > then I can continue match the second "<span>" block. > > > > Maybe I didn't describe this clearly, then feel free to tell me:) > > thanks for any further reply! > > -- > > http://mail.python.org/mailman/listinfo/python-list > > >
-- http://mail.python.org/mailman/listinfo/python-list