date:20181226

Rate limiting a web crawler

2018-12-26 Thread Simon Connah


Hi,

I want to build a simple web crawler. I know how I am going to do it but 
I have one problem.


Obviously I don't want to negatively impact any of the websites that I 
am crawling so I want to implement some form of rate limiting of HTTP 
requests to specific domain names.


What I'd like is some form of timer which calls a piece of code say 
every 5 seconds or something and that code is what goes off and crawls 
the website.


I'm just not sure on the best way to call code based on a timer.

Could anyone offer some advice on the best way to do this? It will be 
running on Linux and using the python-daemon library to run it as a 
service and will be using at least Python 3.6.


Thanks for any help.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Rate limiting a web crawler

2018-12-26 Thread Richard Damon

On 12/26/18 10:35 AM, Simon Connah wrote:
> Hi,
>
> I want to build a simple web crawler. I know how I am going to do it
> but I have one problem.
>
> Obviously I don't want to negatively impact any of the websites that I
> am crawling so I want to implement some form of rate limiting of HTTP
> requests to specific domain names.
>
> What I'd like is some form of timer which calls a piece of code say
> every 5 seconds or something and that code is what goes off and crawls
> the website.
>
> I'm just not sure on the best way to call code based on a timer.
>
> Could anyone offer some advice on the best way to do this? It will be
> running on Linux and using the python-daemon library to run it as a
> service and will be using at least Python 3.6.
>
> Thanks for any help.

One big piece of information that would help in replies would be an
indication of scale. Is you application crawling just a few sites, so
that you need to pause between accesses to keep the hit rate down, or
are you calling a number of sites, so that if you are going to delay
crawling a page from one site, you can go off and crawl another in the
mean time?

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Rate limiting a web crawler

2018-12-26 Thread Terry Reedy


On 12/26/2018 10:35 AM, Simon Connah wrote:

Hi,

I want to build a simple web crawler. I know how I am going to do it but 
I have one problem.


Obviously I don't want to negatively impact any of the websites that I 
am crawling so I want to implement some form of rate limiting of HTTP 
requests to specific domain names.


What I'd like is some form of timer which calls a piece of code say 
every 5 seconds or something and that code is what goes off and crawls 
the website.


I'm just not sure on the best way to call code based on a timer.

Could anyone offer some advice on the best way to do this? It will be 
running on Linux and using the python-daemon library to run it as a 
service and will be using at least Python 3.6.


You can use asyncio to make repeated non-blocking requests to a web site 
at timed intervals and to work with multiple websites at once.  You can 
do the same with tkinter except that requests would block until a 
response unless you implemented your own polling.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: Rate limiting a web crawler

2018-12-26 Thread Simon Connah


On 26/12/2018 18:30, Richard Damon wrote:

On 12/26/18 10:35 AM, Simon Connah wrote:

Hi,

I want to build a simple web crawler. I know how I am going to do it
but I have one problem.

Obviously I don't want to negatively impact any of the websites that I
am crawling so I want to implement some form of rate limiting of HTTP
requests to specific domain names.

What I'd like is some form of timer which calls a piece of code say
every 5 seconds or something and that code is what goes off and crawls
the website.

I'm just not sure on the best way to call code based on a timer.

Could anyone offer some advice on the best way to do this? It will be
running on Linux and using the python-daemon library to run it as a
service and will be using at least Python 3.6.

Thanks for any help.


One big piece of information that would help in replies would be an
indication of scale. Is you application crawling just a few sites, so
that you need to pause between accesses to keep the hit rate down, or
are you calling a number of sites, so that if you are going to delay
crawling a page from one site, you can go off and crawl another in the
mean time?



Sorry. I should have stated that.

This is for a minimum viable product so crawling say two or three domain 
names would be enough to start with but I'd want to grow in the future.


I'm building this on AWS and my idea was to have each web crawler 
instance query a database (DynamoDB) and get say 10 URLs and if they 
hadn't be crawled in the previous say 12 to 24 hours then recrawl them. 
If they have been crawled in the last 12 to 24 hours then skip that URL. 
Once a URL has been crawled I would then save the crawl date and time in 
the database.


Doing it that way I could skip the whole timing thing on the daemon end 
and just use database queries to control whether a URL is crawled or 
not. Of course that would mean that one web crawler would have to "lock" 
a domain name so that multiple instances do not query the same domain 
name in parallel which would be bad.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Rate limiting a web crawler

2018-12-26 Thread Simon Connah


On 26/12/2018 19:04, Terry Reedy wrote:

On 12/26/2018 10:35 AM, Simon Connah wrote:

Hi,

I want to build a simple web crawler. I know how I am going to do it 
but I have one problem.


Obviously I don't want to negatively impact any of the websites that I 
am crawling so I want to implement some form of rate limiting of HTTP 
requests to specific domain names.


What I'd like is some form of timer which calls a piece of code say 
every 5 seconds or something and that code is what goes off and crawls 
the website.


I'm just not sure on the best way to call code based on a timer.

Could anyone offer some advice on the best way to do this? It will be 
running on Linux and using the python-daemon library to run it as a 
service and will be using at least Python 3.6.


You can use asyncio to make repeated non-blocking requests to a web site 
at timed intervals and to work with multiple websites at once.  You can 
do the same with tkinter except that requests would block until a 
response unless you implemented your own polling.




Thank you. I'll look into asynio.
--
https://mail.python.org/mailman/listinfo/python-list

Pycharm issue with import ssl

2018-12-26 Thread Gunasekar Rajendran

While trying to run the Python code in PyCharm 2018.3.2 version I am getting 
the below error. Can someone help?

Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 
2018.3.2\helpers\pydev\pydevconsole.py", line 5, in 
from _pydev_comm.rpc import make_rpc_client, start_rpc_server, 
start_rpc_server_and_make_client
  File "C:\Program Files\JetBrains\PyCharm Community Edition 
2018.3.2\helpers\pydev\_pydev_comm\rpc.py", line 4, in 
from _pydev_comm.server import TSingleThreadedServer
  File "C:\Program Files\JetBrains\PyCharm Community Edition 
2018.3.2\helpers\pydev\_pydev_comm\server.py", line 4, in 
from _shaded_thriftpy.server import TServer
  File "C:\Program Files\JetBrains\PyCharm Community Edition 
2018.3.2\helpers\third_party\thriftpy\_shaded_thriftpy\server.py", line 9, in 

from _shaded_thriftpy.transport import (
  File "C:\Program Files\JetBrains\PyCharm Community Edition 
2018.3.2\helpers\third_party\thriftpy\_shaded_thriftpy\transport\__init__.py", 
line 57, in 
from .sslsocket import TSSLSocket, TSSLServerSocket  # noqa
  File "C:\Program Files\JetBrains\PyCharm Community Edition 
2018.3.2\helpers\third_party\thriftpy\_shaded_thriftpy\transport\sslsocket.py", 
line 7, in 
import ssl
  File "C:\Users\grajendran\Anaconda3\lib\ssl.py", line 98, in 
import _ssl # if we can't import it, let the error propagate
ImportError: DLL load failed: The specified module could not be found.
Process finished with exit code 1
-- 
https://mail.python.org/mailman/listinfo/python-list

Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)

2018-12-26 Thread jfong

I saw the code below at stackoverflow. I have a little idea about the scope of 
a class, and list comprehension and generator expressions, but still can't 
figure out why Z4 works and Z5 not. Can someone explain it? (in a 
not-too-complicated way:-)

class Foo():
XS = [15, 15, 15, 15]
Z4 = sum(val for val in XS)
try:
Z5 = sum(XS[i] for i in range(len(XS)))
except NameError:
Z5 = None

print(Foo.Z4, Foo.Z5)
>>> 60 None


--Jach
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)

2018-12-26 Thread Abdur-Rahmaan Janhangeer

greetings,

1)

Z4 = sum(val for val in XS)

is same as

Z4 = sum(XS)

2)

class Foo()

can also ne written as

class Foo:

3)

in Foo.x you are using the class just to assoxiate some variables with a
name. what is the purpose of tge script / what are you trying to do?

Abdur-Rahmaan Janhangeer
http://www.pythonmembers.club | https://github.com/Abdur-rahmaanJ
Mauritius
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)

2018-12-26 Thread Chris Angelico

On Thu, Dec 27, 2018 at 1:56 PM  wrote:
>
> I saw the code below at stackoverflow. I have a little idea about the scope 
> of a class, and list comprehension and generator expressions, but still can't 
> figure out why Z4 works and Z5 not. Can someone explain it? (in a 
> not-too-complicated way:-)
>
> class Foo():
> XS = [15, 15, 15, 15]
> Z4 = sum(val for val in XS)
> try:
> Z5 = sum(XS[i] for i in range(len(XS)))
> except NameError:
> Z5 = None
>
> print(Foo.Z4, Foo.Z5)
> >>> 60 None
>

Class scope is special, and a generator expression within that class
scope is special too. There have been proposals to make these kinds of
things less special, but the most important thing to remember is that
when you create a generator expression, it is actually a function.
Remember that a function inside a class statement becomes a method,
and that inside the method, you have to use "self.X" rather than just
"X" to reference class attributes. That's what's happening here.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)

2018-12-26 Thread eryk sun

On 12/26/18, jf...@ms4.hinet.net  wrote:
> I saw the code below at stackoverflow. I have a little idea about the scope
> of a class, and list comprehension and generator expressions, but still
> can't figure out why Z4 works and Z5 not. Can someone explain it? (in a
> not-too-complicated way:-)
>
> class Foo():
> XS = [15, 15, 15, 15]
> Z4 = sum(val for val in XS)
> try:
> Z5 = sum(XS[i] for i in range(len(XS)))
> except NameError:
> Z5 = None
>
> print(Foo.Z4, Foo.Z5)
 60 None

Maybe rewriting it with approximately equivalent inline code and
generator functions will clarify the difference:

class Foo:
def genexpr1(iterable):
for val in iterable:
yield val

def genexpr2(iterable):
for i in iterable:
yield XS[i]

XS = [15, 15, 15, 15]
Z4 = sum(genexpr1(XS))
try:
Z5 = sum(genexpr2(range(len(XS
except NameError:
Z5 = None

del genexpr1, genexpr2

>>> print(Foo.Z4, Foo.Z5)
60 None

In both cases, an iterable is passed to the generator function. This
argument is evaluated in the calling scope (e.g. range(len(XS))). A
generator expression has a similar implementation, except it also
evaluates the iterator for the iterable to ensure an exception is
raised immediately in the defining scope if it's not iterable. For
example:

>>> (x for x in 1)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'int' object is not iterable

genexpr1 is working with local variables only, but genexpr2 has a
non-local reference to variable XS, which we call late binding. In
this case, when the generator code executes the first pass of the loop
(whenever that is), it looks for XS in the global (module) scope and
builtins scope. It's not there, so a NameError is raised.

With late-binding, the variable can get deleted or modified in the
source scope while the generator gets evaluated. For example:

>>> x = 'spam'
>>> g = (x[i] for i in range(len(x)))
>>> next(g)
's'
>>> del x
>>> next(g)
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 1, in 
NameError: name 'x' is not defined

>>> x = 'spam'
>>> g = (x[i] for i in range(len(x)))
>>> next(g)
's'
>>> x = 'eggs'
>>> list(g)
['g', 'g', 's']
-- 
https://mail.python.org/mailman/listinfo/python-list

Rate limiting a web crawler

Re: Rate limiting a web crawler

Re: Rate limiting a web crawler

Re: Rate limiting a web crawler

Re: Rate limiting a web crawler

Pycharm issue with import ssl

Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)

Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)

Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)

Re: Ask for help about class variable scope (Re: Why doesn't a dictionary work in classes?)

10 matches

Site Navigation

Mail list logo

Footer information