> -----Original Message----- > From: Python-list [mailto:python-list- > bounces+matt=centralkaos....@python.org] On Behalf Of Peter Otten > Sent: Tuesday, 19 January 2016 9:30 PM > To: python-list@python.org > Subject: Re: web scraping help / better way to do it ? > > Matt wrote: > > > Beginner python user (3.5) and trying to scrape this page and get the > > ladder > > - www.afl.com.au/ladder . Its dynamic content so I used lynx -dump to > > get > > a txt file and parsing that. > > > > Here is the code > > > > # import lynx -dump txt file > > f = open('c:/temp/afl2.txt','r').read() > > > > # Put import txt file into list > > afl_list = f.split(' ') > > > > #here are the things we want to search for search_list = ['FRE', > > 'WCE', 'HAW', 'SYD', 'RICH', 'WB', 'ADEL', 'NMFC', 'PORT', 'GEEL', > > 'GWS', 'COLL', 'MELB', 'STK', 'ESS', 'GCFC', 'BL', 'CARL'] > > > > def build_ladder(): > > for l in search_list: > > output_num = afl_list.index(l) > > list_pos = output_num -1 > > ladder_pos = afl_list[list_pos] > > print(ladder_pos + ' ' + '-' + ' ' + l) > > > > build_ladder() > > > > > > Which outputs this. > > > > 1 - FRE > > 2 - WCE > > 3 - HAW > > 4 - SYD > > 5 - RICH > > 6 - WB > > 7 - ADEL > > 8 - NMFC > > 9 - PORT > > 10 - GEEL > > * - GWS > > 12 - COLL > > 13 - MELB > > 14 - STK > > 15 - ESS > > 16 - GCFC > > 17 - BL > > 18 - CARL > > > > Notice that number 11 is missing because my script picks up "GWS" > > which is located earlier in the page. What is the best way to skip > > that (and get the "GWS" lower down in the txt file) or am I better off > > approaching the code in a different way? > > If you look at the html source you'll see that the desired "GWS" is inside a > table, together with the other abbreviations. To extract (parts of) that table > you should use a tool that understands the structure of html. > > The most popular library to parse html with Python is BeautifulSoup, but my > example uses lxml: > > $ cat ladder.py > import urllib.request > import io > import lxml.html > > def first(row, xpath): > return row.xpath(xpath)[0].strip() > > html = urllib.request.urlopen("http://www.afl.com.au/ladder").read() > tree = lxml.html.parse(io.BytesIO(html)) > > for row in tree.xpath("//tr")[1:]: > print( > first(row, ".//td[1]/span/text()"), > first(row, ".//abbr/text()")) > > $ python3 ladder.py > 1 FRE > 2 WCE > 3 HAW > 4 SYD > 5 RICH > 6 WB > 7 ADEL > 8 NMFC > 9 PORT > 10 GEEL > 11 GWS > 12 COLL > 13 MELB > 14 STK > 15 ESS > 16 GCFC > 17 BL > 18 CARL > > > Someone with better knowledge of XPath could probably avoid some of the > postprocessing I do in Python. > > -- Thanks Peter, you opened my eyes to a half dozen things here, just what I needed.
Much appreciated Cheers - Matt -- https://mail.python.org/mailman/listinfo/python-list