Jerry Rocteur wrote: > Hi, > > I'm trying to parse https://matchup.io/players/rocteur/friends > > The body source I'm interested in contains blocks exactly like this > > <tr class='friend'> > <td class='text--left'> > <a href="/players/mizucci0"><img alt="mizucci0" class="media__avatar" > src="https://matchup-io.s3.amazonaws.com/uploads/player/avatar/7651/7651_profile_150_square.jpeg" > /> > <div class='friend__info'> > <span>mizucci0</span> > <span>Mizuho</span> > </div> > </a></td> > <td class='delta-alt'> > 29,646 > <br> > steps > </td> > <td class='delta-alt'> > 35,315 > <br> > steps > </td> > <td class='delta-alt'> > 818.7 > <br> > Miles > </td> > </tr> > > I wanted to do it Python as I'm learning and I looked at the different > modules but it isn't easy for me to work out the best way to do this > as most tutorials I see use complicated classes and I just want to > parse this one paragraph at a time (as I would do in Perl) and print > > 1 mizuho 26648 35315 > 2 xxxxxx 99999 99999 > 3 xxxxxx 99999 99999 > > etc. (in the above case I'm ignoring 818.7 and Miles. > > The best way I found so far is this: > > from lxml import html > import requests > page = requests.get("https://matchup.io/players/rocteur/friends/week/") > tree = html.fromstring(page.text) > a = tree.xpath('//span/text()') > b = tree.xpath('//td/text()') > > And the manipulating indices > > e.g. > print "%s %s %s %s" % (a[usern], a[users], b[tots], b[weekb]) > tots += 4 > weekb += 4 > usern += 2 > users += 2 > > But it isn't very scientific ;-)
In my experience scraping data from a web page never is. The trick is to not waste too much time on your script once you have it working. The next overhaul of the scraped page is already on the way, and yes, it will heavily use javascript ;) > Which module would you use and how would you suggest is the best way to do > it ? I think lxml is a good choice. Is there something with an API you prefer in Perl? > Thanks very much in advance, I haven't done a lot of HTML parsing.. I > would much prefer using WebServices and an API but unfortunately they > don't have it. PS: Here's my take: import requests import lxml.html def get_html(): return requests.get("https://matchup.io/players/rocteur/friends/week/").text def fix(value): return value.text.strip().replace(",", "") tree = lxml.html.fromstring(get_html()) for friend in tree.xpath('//tr[@class="friend"]'): values = friend.xpath('.//td[@class="delta-alt"]') print( friend.xpath('.//div/span[2]/text()')[0], fix(values[0]), fix(values[1]) ) -- https://mail.python.org/mailman/listinfo/python-list