On Sat, Dec 7, 2019 at 1:21 PM DL Neil via Python-list <python-list@python.org> wrote: > > On 7/12/19 1:51 PM, Chris Angelico wrote: > > On Sat, Dec 7, 2019 at 11:46 AM Michael Torrie <torr...@gmail.com> wrote: > >> > >> On 12/6/19 5:31 PM, DL Neil via Python-list wrote: > >>> If you read the HTML data that the REPL has happily splattered all over > >>> your terminal's screen (scroll back) (NB "soup" is easier to read than > >>> is "content"!) you will observe that what you saw in your web-browser is > >>> not what Amazon served in response to the Python "requests.get()"! > >> > >> Sadly it's likely that Amazon's page is largely built from javascript. > >> So scraping static html is probably not going to get you where you want > >> to go. There are heavier tools, such as Selenium that uses a real > >> browser to grab a page, and the result of that you can parse and search > >> perhaps. > > > > Or look for an API instead. > > > Both +1 > However, Selenium is possibly less-manageable for a 'beginner'. > (NB my poorly-based assumption of OP) > > Amazon's HTML-response actually says this/these, but I left it open as a > (learning) exercise for the OP. They likely prefer the API approach, > because it can be measured... >
Yes, and because it's way WAY easier to guarantee API stability than Selenium-based page parseability. But even when there's no *actual* API, you can sometimes delve into the page and find the actual useful content, perhaps as a big blob of JSON inside a <script> tag. There'll be no guarantees, of course (but there aren't any with parsing the HTML either), but it'll be way easier to parse. ChrisA -- https://mail.python.org/mailman/listinfo/python-list