Re: using urllib on a more complex site

Adam W. Sun, 24 Feb 2013 17:40:33 -0800

On Sunday, February 24, 2013 7:27:54 PM UTC-5, Chris Rebert wrote:
> On Sunday, February 24, 2013, Adam W.  wrote:
> I'm trying to write a simple script to scrape 
> http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day
> 
> 
> 
> 
> in order to send myself an email every day of the 99c movie of the day.
> 
> 
> 
> However, using a simple command like (in Python 3.0):
> 
> urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()
> 
> 
> 
> 
> I don't get the all the source I need, its just the navigation buttons.  Now 
> I assume they are using some CSS/javascript witchcraft to load all the useful 
> data later, so my question is how do I make urllib "wait" and grab that data 
> as well?
> 
> 
> 
> 
> 
> urllib isn't a web browser. It just requests the single (in this case, HTML) 
> file from the given URL. It does not parse the HTML (indeed, it doesn't care 
> what kind of file you're dealing with); therefore, it obviously does not 
> retrieve the other resources linked within the document (CSS, JS, images, 
> etc.) nor does it run any JavaScript. So, there's nothing to "wait" for; 
> urllib is already doing everything it was designed to do.
> 
> 
> 
> Your best bet is to open the page in a web browser yourself and use the 
> developer tools/inspectors to watch what XHR requests the page's scripts are 
> making, find the one(s) that have the data you care about, and then make 
> those requests instead via urllib (or the `requests` 3rd-party lib, or 
> whatever). If the URL(s) vary, reverse-engineering the scheme used to 
> generate them will also be required.
> 
> 
> 
> Alternatively, you could use something like Selenium, which let's you drive 
> an actual full web browser (e.g. Firefox) from Python.
> 
> 
> Cheers,
> Chris
> 
> 
> -- 
> Cheers,
> Chris
> --
> http://rebertia.com


Huzzah! Found it: 
http://apicache.vudu.com/api2/claimedAppId/myvudu/format/application*2Fjson/callback/DirectorSequentialCallback/_type/contentSearch/count/30/dimensionality/any/followup/ratingsSummaries/followup/totalCount/offset/0/tag/99centOfTheDay/type/program/type/season/type/episode/type/bundle

Thanks for the tip about XHR's
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: using urllib on a more complex site

Reply via email to