Cloudflare, for whatever reason, appears to be rejecting the `User- Agent` header that urllib is providing:`Python-urllib/3.9`. Using a different `User-Agent` seems to get around the issue:
import urllib.request req = urllib.request.Request( url="https://juno.sh/direct-connection-to-jupyter-server/", method="GET", headers={"User-Agent": "Workaround/1.0"}, ) res = urllib.request.urlopen(req) Paul On Tue, 2021-12-07 at 12:35 +0100, Julius Hamilton wrote: > Hey, > > I am currently working on a simple program which scrapes text from > webpages > via a URL, then segments it (with Spacy). > > I’m trying to refine my program to use just the right tools for the > job, > for each of the steps. > > Requests.get works great, but I’ve seen people use > urllib.request.urlopen() > in some examples. It appealed to me because it seemed lower level > than > requests.get, so it just makes the program feel leaner and purer and > more > direct. > > However, requests.get works fine on this url: > > https://juno.sh/direct-connection-to-jupyter-server/ > > But urllib returns a “403 forbidden”. > > Could anyone please comment on what the fundamental differences are > between > urllib vs. requests, why this would happen, and if urllib has any > option to > prevent this and get the page source? > > Thanks, > Julius -- https://mail.python.org/mailman/listinfo/python-list