On Wed, Dec 8, 2021 at 4:51 AM Julius Hamilton <juliushamilton...@gmail.com> wrote: > > Hey, > > I am currently working on a simple program which scrapes text from webpages > via a URL, then segments it (with Spacy). > > I’m trying to refine my program to use just the right tools for the job, > for each of the steps. > > Requests.get works great, but I’ve seen people use urllib.request.urlopen() > in some examples. It appealed to me because it seemed lower level than > requests.get, so it just makes the program feel leaner and purer and more > direct. > > However, requests.get works fine on this url: > > https://juno.sh/direct-connection-to-jupyter-server/ > > But urllib returns a “403 forbidden”. > > Could anyone please comment on what the fundamental differences are between > urllib vs. requests, why this would happen, and if urllib has any option to > prevent this and get the page source? >
*Fundamental* differences? Not many. The requests module is designed to be easy to use, whereas urllib is designed to be basic and simple. Not really a fundamental difference, but perhaps indicative. I'd recommend doing the query with requests, and seeing exactly what headers are being sent. Most likely, there'll be something that you need to add explicitly when using urllib that the server is looking for (maybe a user agent or something). Requests uses Python's logging module to configure everything, so it should be a simple matter of setting log level to DEBUG and sending the request. TBH though, I'd just recommend using requests, unless you specifically need to avoid the dependency :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list