Tom Russell <tsrdatat...@gmail.com> writes: > I am parsing out a web page at > http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar > using BeautifulSoup. > > My problem is that I can parse into the table where the data I want > resides but I cannot seem to figure out how to go about grabbing the > contents of the cell next to my row header I want. > > For instance this code below: > > soup = > BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar')) > > table = soup.find("table",{"class": "mdcTable"}) > for row in table.findAll("tr"): > for cell in row.findAll("td"): > print cell.findAll(text=True) > > brings in a list that looks like this: > > [u'NYSE'] > [u'Latest close'] > [u'Previous close'] > ... > > What I want to do is only be getting the data for NYSE and nothing > else so I do not know if that's possible or not.
I am quite confident that it is possible (though I do not know the details). First thing to note: you can use the "break" statement in order to leave a loop "before time". As you have a nested loop, you might need a "break" on both levels, the outer loop's "break" probably controlled by a variable which indicates "success". Second thing to note: the "BeautifulSoup" documentation might tell you something about the return values of its methods. I assume "BeautifulSoup" builds upon "lxml" and the return values are "lxml" related. Then the "lxml" documentation would tell you how to inspect further details about the html structure. -- http://mail.python.org/mailman/listinfo/python-list