Matt wrote: > Beginner python user (3.5) and trying to scrape this page and get the > ladder > - www.afl.com.au/ladder . Its dynamic content so I used lynx -dump to > get > a txt file and parsing that. > > Here is the code > > # import lynx -dump txt file > f = open('c:/temp/afl2.txt','r').read() > > # Put import txt file into list > afl_list = f.split(' ') > > #here are the things we want to search for > search_list = ['FRE', 'WCE', 'HAW', 'SYD', 'RICH', 'WB', 'ADEL', 'NMFC', > 'PORT', 'GEEL', 'GWS', 'COLL', 'MELB', 'STK', 'ESS', 'GCFC', 'BL', 'CARL'] > > def build_ladder(): > for l in search_list: > output_num = afl_list.index(l) > list_pos = output_num -1 > ladder_pos = afl_list[list_pos] > print(ladder_pos + ' ' + '-' + ' ' + l) > > build_ladder() > > > Which outputs this. > > 1 - FRE > 2 - WCE > 3 - HAW > 4 - SYD > 5 - RICH > 6 - WB > 7 - ADEL > 8 - NMFC > 9 - PORT > 10 - GEEL > * - GWS > 12 - COLL > 13 - MELB > 14 - STK > 15 - ESS > 16 - GCFC > 17 - BL > 18 - CARL > > Notice that number 11 is missing because my script picks up "GWS" which is > located earlier in the page. What is the best way to skip that (and get > the "GWS" lower down in the txt file) or am I better off approaching the > code in a different way?
If you look at the html source you'll see that the desired "GWS" is inside a table, together with the other abbreviations. To extract (parts of) that table you should use a tool that understands the structure of html. The most popular library to parse html with Python is BeautifulSoup, but my example uses lxml: $ cat ladder.py import urllib.request import io import lxml.html def first(row, xpath): return row.xpath(xpath)[0].strip() html = urllib.request.urlopen("http://www.afl.com.au/ladder").read() tree = lxml.html.parse(io.BytesIO(html)) for row in tree.xpath("//tr")[1:]: print( first(row, ".//td[1]/span/text()"), first(row, ".//abbr/text()")) $ python3 ladder.py 1 FRE 2 WCE 3 HAW 4 SYD 5 RICH 6 WB 7 ADEL 8 NMFC 9 PORT 10 GEEL 11 GWS 12 COLL 13 MELB 14 STK 15 ESS 16 GCFC 17 BL 18 CARL Someone with better knowledge of XPath could probably avoid some of the postprocessing I do in Python. -- https://mail.python.org/mailman/listinfo/python-list