On Sunday, February 24, 2013 7:30:00 PM UTC-5, Dave Angel wrote: > On 02/24/2013 07:02 PM, Adam W. wrote: > > > I'm trying to write a simple script to scrape > > http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day > > > > > > in order to send myself an email every day of the 99c movie of the day. > > > > > > However, using a simple command like (in Python 3.0): > > > urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read() > > > > > > I don't get the all the source I need, its just the navigation buttons. > > Now I assume they are using some CSS/javascript witchcraft to load all the > > useful data later, so my question is how do I make urllib "wait" and grab > > that data as well? > > > > > > > The CSS and the jpegs, and many other aspects of a web "page" are loaded > > explicitly, by the browser, when parsing the tags of the page you > > downloaded. There is no sooner or later. The website won't send the > > other files until you request them. > > > > For example, that site at the moment has one image (prob. jpeg) > > highlighted, > > > > <img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m" > > alt="Sex and the City: The Movie (Theatrical)"> > > > > if you want to look at that jpeg, you need to download the file url > > specified by the src attribute of that img element. > > > > Or perhaps you can just look at the 'alt' attribute, which is mainly > > there for browsers who don't happen to do graphics, for example, the > > ones for the blind. > > > > Naturally, there may be dozens of images on the page, and there's no > > guarantee that the website author is trying to make it easy for you. > > Why not check if there's a defined api for extracting the information > > you want? Check the site, or send a message to the webmaster. > > > > No guarantee that tomorrow, the information won't be buried in some > > javascript fragment. Again, if you want to see that, you might need to > > write a javascript interpreter. it could use any algorithm at all to > > build webpage information, and the encoding could change day by day, or > > hour by hour. > > > > -- > > DaveA
The problem is, the image url you found is not returned in the data urllib grabs. To be clear, I was aware of what urllib is supposed to do (ie not download image data when loading a page), I've used it before many times, just never had to jump through hoops to get at the content I needed. I'll look into figuring out how to find XHR requests in Chrome, I didn't know what they called that after the fact loading, so now my searching will be more productive. -- http://mail.python.org/mailman/listinfo/python-list