Rate limiting a web crawler
Hi, I want to build a simple web crawler. I know how I am going to do it but I have one problem. Obviously I don't want to negatively impact any of the websites that I am crawling so I want to implement some form of rate limiting of HTTP requests to specific domain names. What I'd like is some form of timer which calls a piece of code say every 5 seconds or something and that code is what goes off and crawls the website. I'm just not sure on the best way to call code based on a timer. Could anyone offer some advice on the best way to do this? It will be running on Linux and using the python-daemon library to run it as a service and will be using at least Python 3.6. Thanks for any help. -- https://mail.python.org/mailman/listinfo/python-list
Re: Rate limiting a web crawler
On 12/26/18 10:35 AM, Simon Connah wrote: > Hi, > > I want to build a simple web crawler. I know how I am going to do it > but I have one problem. > > Obviously I don't want to negatively impact any of the websites that I > am crawling so I want to implement some form of rate limiting of HTTP > requests to specific domain names. > > What I'd like is some form of timer which calls a piece of code say > every 5 seconds or something and that code is what goes off and crawls > the website. > > I'm just not sure on the best way to call code based on a timer. > > Could anyone offer some advice on the best way to do this? It will be > running on Linux and using the python-daemon library to run it as a > service and will be using at least Python 3.6. > > Thanks for any help. One big piece of information that would help in replies would be an indication of scale. Is you application crawling just a few sites, so that you need to pause between accesses to keep the hit rate down, or are you calling a number of sites, so that if you are going to delay crawling a page from one site, you can go off and crawl another in the mean time? -- Richard Damon -- https://mail.python.org/mailman/listinfo/python-list
Re: Rate limiting a web crawler
On 12/26/2018 10:35 AM, Simon Connah wrote: Hi, I want to build a simple web crawler. I know how I am going to do it but I have one problem. Obviously I don't want to negatively impact any of the websites that I am crawling so I want to implement some form of rate limiting of HTTP requests to specific domain names. What I'd like is some form of timer which calls a piece of code say every 5 seconds or something and that code is what goes off and crawls the website. I'm just not sure on the best way to call code based on a timer. Could anyone offer some advice on the best way to do this? It will be running on Linux and using the python-daemon library to run it as a service and will be using at least Python 3.6. You can use asyncio to make repeated non-blocking requests to a web site at timed intervals and to work with multiple websites at once. You can do the same with tkinter except that requests would block until a response unless you implemented your own polling. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Rate limiting a web crawler
On 26/12/2018 18:30, Richard Damon wrote: On 12/26/18 10:35 AM, Simon Connah wrote: Hi, I want to build a simple web crawler. I know how I am going to do it but I have one problem. Obviously I don't want to negatively impact any of the websites that I am crawling so I want to implement some form of rate limiting of HTTP requests to specific domain names. What I'd like is some form of timer which calls a piece of code say every 5 seconds or something and that code is what goes off and crawls the website. I'm just not sure on the best way to call code based on a timer. Could anyone offer some advice on the best way to do this? It will be running on Linux and using the python-daemon library to run it as a service and will be using at least Python 3.6. Thanks for any help. One big piece of information that would help in replies would be an indication of scale. Is you application crawling just a few sites, so that you need to pause between accesses to keep the hit rate down, or are you calling a number of sites, so that if you are going to delay crawling a page from one site, you can go off and crawl another in the mean time? Sorry. I should have stated that. This is for a minimum viable product so crawling say two or three domain names would be enough to start with but I'd want to grow in the future. I'm building this on AWS and my idea was to have each web crawler instance query a database (DynamoDB) and get say 10 URLs and if they hadn't be crawled in the previous say 12 to 24 hours then recrawl them. If they have been crawled in the last 12 to 24 hours then skip that URL. Once a URL has been crawled I would then save the crawl date and time in the database. Doing it that way I could skip the whole timing thing on the daemon end and just use database queries to control whether a URL is crawled or not. Of course that would mean that one web crawler would have to "lock" a domain name so that multiple instances do not query the same domain name in parallel which would be bad. -- https://mail.python.org/mailman/listinfo/python-list
Re: Rate limiting a web crawler
On 26/12/2018 19:04, Terry Reedy wrote: On 12/26/2018 10:35 AM, Simon Connah wrote: Hi, I want to build a simple web crawler. I know how I am going to do it but I have one problem. Obviously I don't want to negatively impact any of the websites that I am crawling so I want to implement some form of rate limiting of HTTP requests to specific domain names. What I'd like is some form of timer which calls a piece of code say every 5 seconds or something and that code is what goes off and crawls the website. I'm just not sure on the best way to call code based on a timer. Could anyone offer some advice on the best way to do this? It will be running on Linux and using the python-daemon library to run it as a service and will be using at least Python 3.6. You can use asyncio to make repeated non-blocking requests to a web site at timed intervals and to work with multiple websites at once. You can do the same with tkinter except that requests would block until a response unless you implemented your own polling. Thank you. I'll look into asynio. -- https://mail.python.org/mailman/listinfo/python-list
Pycharm issue with import ssl
While trying to run the Python code in PyCharm 2018.3.2 version I am getting the below error. Can someone help? Traceback (most recent call last): File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.2\helpers\pydev\pydevconsole.py", line 5, in from _pydev_comm.rpc import make_rpc_client, start_rpc_server, start_rpc_server_and_make_client File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.2\helpers\pydev\_pydev_comm\rpc.py", line 4, in from _pydev_comm.server import TSingleThreadedServer File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.2\helpers\pydev\_pydev_comm\server.py", line 4, in from _shaded_thriftpy.server import TServer File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.2\helpers\third_party\thriftpy\_shaded_thriftpy\server.py", line 9, in from _shaded_thriftpy.transport import ( File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.2\helpers\third_party\thriftpy\_shaded_thriftpy\transport\__init__.py", line 57, in from .sslsocket import TSSLSocket, TSSLServerSocket # noqa File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.3.2\helpers\third_party\thriftpy\_shaded_thriftpy\transport\sslsocket.py", line 7, in import ssl File "C:\Users\grajendran\Anaconda3\lib\ssl.py", line 98, in import _ssl # if we can't import it, let the error propagate ImportError: DLL load failed: The specified module could not be found. Process finished with exit code 1 -- https://mail.python.org/mailman/listinfo/python-list
Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)
I saw the code below at stackoverflow. I have a little idea about the scope of a class, and list comprehension and generator expressions, but still can't figure out why Z4 works and Z5 not. Can someone explain it? (in a not-too-complicated way:-) class Foo(): XS = [15, 15, 15, 15] Z4 = sum(val for val in XS) try: Z5 = sum(XS[i] for i in range(len(XS))) except NameError: Z5 = None print(Foo.Z4, Foo.Z5) >>> 60 None --Jach -- https://mail.python.org/mailman/listinfo/python-list
Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)
greetings, 1) Z4 = sum(val for val in XS) is same as Z4 = sum(XS) 2) class Foo() can also ne written as class Foo: 3) in Foo.x you are using the class just to assoxiate some variables with a name. what is the purpose of tge script / what are you trying to do? Abdur-Rahmaan Janhangeer http://www.pythonmembers.club | https://github.com/Abdur-rahmaanJ Mauritius -- https://mail.python.org/mailman/listinfo/python-list
Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)
On Thu, Dec 27, 2018 at 1:56 PM wrote: > > I saw the code below at stackoverflow. I have a little idea about the scope > of a class, and list comprehension and generator expressions, but still can't > figure out why Z4 works and Z5 not. Can someone explain it? (in a > not-too-complicated way:-) > > class Foo(): > XS = [15, 15, 15, 15] > Z4 = sum(val for val in XS) > try: > Z5 = sum(XS[i] for i in range(len(XS))) > except NameError: > Z5 = None > > print(Foo.Z4, Foo.Z5) > >>> 60 None > Class scope is special, and a generator expression within that class scope is special too. There have been proposals to make these kinds of things less special, but the most important thing to remember is that when you create a generator expression, it is actually a function. Remember that a function inside a class statement becomes a method, and that inside the method, you have to use "self.X" rather than just "X" to reference class attributes. That's what's happening here. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)
On 12/26/18, jf...@ms4.hinet.net wrote: > I saw the code below at stackoverflow. I have a little idea about the scope > of a class, and list comprehension and generator expressions, but still > can't figure out why Z4 works and Z5 not. Can someone explain it? (in a > not-too-complicated way:-) > > class Foo(): > XS = [15, 15, 15, 15] > Z4 = sum(val for val in XS) > try: > Z5 = sum(XS[i] for i in range(len(XS))) > except NameError: > Z5 = None > > print(Foo.Z4, Foo.Z5) 60 None Maybe rewriting it with approximately equivalent inline code and generator functions will clarify the difference: class Foo: def genexpr1(iterable): for val in iterable: yield val def genexpr2(iterable): for i in iterable: yield XS[i] XS = [15, 15, 15, 15] Z4 = sum(genexpr1(XS)) try: Z5 = sum(genexpr2(range(len(XS except NameError: Z5 = None del genexpr1, genexpr2 >>> print(Foo.Z4, Foo.Z5) 60 None In both cases, an iterable is passed to the generator function. This argument is evaluated in the calling scope (e.g. range(len(XS))). A generator expression has a similar implementation, except it also evaluates the iterator for the iterable to ensure an exception is raised immediately in the defining scope if it's not iterable. For example: >>> (x for x in 1) Traceback (most recent call last): File "", line 1, in TypeError: 'int' object is not iterable genexpr1 is working with local variables only, but genexpr2 has a non-local reference to variable XS, which we call late binding. In this case, when the generator code executes the first pass of the loop (whenever that is), it looks for XS in the global (module) scope and builtins scope. It's not there, so a NameError is raised. With late-binding, the variable can get deleted or modified in the source scope while the generator gets evaluated. For example: >>> x = 'spam' >>> g = (x[i] for i in range(len(x))) >>> next(g) 's' >>> del x >>> next(g) Traceback (most recent call last): File "", line 1, in File "", line 1, in NameError: name 'x' is not defined >>> x = 'spam' >>> g = (x[i] for i in range(len(x))) >>> next(g) 's' >>> x = 'eggs' >>> list(g) ['g', 'g', 's'] -- https://mail.python.org/mailman/listinfo/python-list