On Sunday, February 24, 2013 7:27:54 PM UTC-5, Chris Rebert wrote: > On Sunday, February 24, 2013, Adam W. wrote: > I'm trying to write a simple script to scrape > http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day > > > > > in order to send myself an email every day of the 99c movie of the day. > > > > However, using a simple command like (in Python 3.0): > > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read() > > > > > I don't get the all the source I need, its just the navigation buttons. Now > I assume they are using some CSS/javascript witchcraft to load all the useful > data later, so my question is how do I make urllib "wait" and grab that data > as well? > > > > > > urllib isn't a web browser. It just requests the single (in this case, HTML) > file from the given URL. It does not parse the HTML (indeed, it doesn't care > what kind of file you're dealing with); therefore, it obviously does not > retrieve the other resources linked within the document (CSS, JS, images, > etc.) nor does it run any JavaScript. So, there's nothing to "wait" for; > urllib is already doing everything it was designed to do. > > > > Your best bet is to open the page in a web browser yourself and use the > developer tools/inspectors to watch what XHR requests the page's scripts are > making, find the one(s) that have the data you care about, and then make > those requests instead via urllib (or the `requests` 3rd-party lib, or > whatever). If the URL(s) vary, reverse-engineering the scheme used to > generate them will also be required. > > > > Alternatively, you could use something like Selenium, which let's you drive > an actual full web browser (e.g. Firefox) from Python. > > > Cheers, > Chris > > > -- > Cheers, > Chris > -- > http://rebertia.com
Huzzah! Found it: http://apicache.vudu.com/api2/claimedAppId/myvudu/format/application*2Fjson/callback/DirectorSequentialCallback/_type/contentSearch/count/30/dimensionality/any/followup/ratingsSummaries/followup/totalCount/offset/0/tag/99centOfTheDay/type/program/type/season/type/episode/type/bundle Thanks for the tip about XHR's -- http://mail.python.org/mailman/listinfo/python-list