BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

Greetings,

I have Python 3.6 script on Windows to scrape comment history from a 
website. It's currently set up this way:


Requestor (threads) -> list -> Parser (threads) -> queue -> CVSWriter 
(single thread)


It takes 15 minutes to process ~11,000 comments.

When I replaced the list with a queue between the Requestor and Parser 
to speed up things, BeautifulSoup stopped working.


When I changed BeautifulSoup(contents, "lxml") to 
BeautifulSoup(contents), I get the UserWarning that no parser wasn't 
explicitly set and a reference to line 80 in threading.py (which puts it 
in the RLock factory function).


When I switched back to using list between the Requestor and Parser, the 
Parser worked again.


BeautifulSoup doesn't work with a threaded input queue?

Thank you,

Chris Reimer

--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
Christopher Reimer via Python-list wrote:

> Greetings,
> 
> I have Python 3.6 script on Windows to scrape comment history from a
> website. It's currently set up this way:
> 
> Requestor (threads) -> list -> Parser (threads) -> queue -> CVSWriter
> (single thread)
> 
> It takes 15 minutes to process ~11,000 comments.
> 
> When I replaced the list with a queue between the Requestor and Parser
> to speed up things, BeautifulSoup stopped working.
> 
> When I changed BeautifulSoup(contents, "lxml") to
> BeautifulSoup(contents), I get the UserWarning that no parser wasn't
> explicitly set and a reference to line 80 in threading.py (which puts it
> in the RLock factory function).
> 
> When I switched back to using list between the Requestor and Parser, the
> Parser worked again.
> 
> BeautifulSoup doesn't work with a threaded input queue?

The documentation

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small 
code sample would be ideal...

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

On 8/27/2017 11:54 AM, Peter Otten wrote:


The documentation

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small
code sample would be ideal.


A worker thread uses a request object to get the page and puts it into 
queue as page.content (HTML).  Another worker thread gets the 
page.content from the queue to apply BeautifulSoup and nothing happens.


soup = BeautifulSoup(page_content, 'lxml')
print(soup)

No output whatsoever. If I remove 'lxml', I get the UserWarning that no 
parser wasn't explicitly set and get the reference to threading.py at 
line 80.


I verified that page.content that goes into and out of the queue is the 
same page.content that goes into and out of a list.


I read somewhere that BeautifulSoup may not be thread-safe. I've never 
had a problem with threads storing the output into a queue. Using a 
queue (random order) instead of a list (sequential order) to feed pages 
for the input is making it wonky.


Chris R.
--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread MRAB

On 2017-08-27 20:35, Christopher Reimer via Python-list wrote:

On 8/27/2017 11:54 AM, Peter Otten wrote:


The documentation

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup

says you can make the BeautifulSoup object from a string or file.
Can you give a few more details where the queue comes into play? A small
code sample would be ideal.


A worker thread uses a request object to get the page and puts it into
queue as page.content (HTML).  Another worker thread gets the
page.content from the queue to apply BeautifulSoup and nothing happens.

soup = BeautifulSoup(page_content, 'lxml')
print(soup)

No output whatsoever. If I remove 'lxml', I get the UserWarning that no
parser wasn't explicitly set and get the reference to threading.py at
line 80.

I verified that page.content that goes into and out of the queue is the
same page.content that goes into and out of a list.

I read somewhere that BeautifulSoup may not be thread-safe. I've never
had a problem with threads storing the output into a queue. Using a
queue (random order) instead of a list (sequential order) to feed pages
for the input is making it wonky.

What do you mean by "queue (random order)"? A queue is sequential order, 
first-in-first-out.

--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
Christopher Reimer via Python-list wrote:

> On 8/27/2017 11:54 AM, Peter Otten wrote:
> 
>> The documentation
>>
>> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup
>>
>> says you can make the BeautifulSoup object from a string or file.
>> Can you give a few more details where the queue comes into play? A small
>> code sample would be ideal.
> 
> A worker thread uses a request object to get the page and puts it into
> queue as page.content (HTML).  Another worker thread gets the
> page.content from the queue to apply BeautifulSoup and nothing happens.
> 
> soup = BeautifulSoup(page_content, 'lxml')
> print(soup)
> 
> No output whatsoever. If I remove 'lxml', I get the UserWarning that no
> parser wasn't explicitly set and get the reference to threading.py at
> line 80.
> 
> I verified that page.content that goes into and out of the queue is the
> same page.content that goes into and out of a list.
> 
> I read somewhere that BeautifulSoup may not be thread-safe. I've never
> had a problem with threads storing the output into a queue. Using a
> queue (random order) instead of a list (sequential order) to feed pages
> for the input is making it wonky.

Here's a simple example that extracts titles from generated html. It seems 
to work. Does it resemble what you do?

import csv
import threading
import time
from queue import Queue

import bs4


def process_html(source, dest, index):
while True:
html = source.get()
if html is DONE:
dest.put(DONE)
break
soup = bs4.BeautifulSoup(html, "lxml")
dest.put(soup.find("title").text)


def write_csv(source, filename, to_go):
with open(filename, "w") as f:
writer = csv.writer(f)
while True:
title = source.get()
if title is DONE:
to_go -= 1
if not to_go:
return
else:
writer.writerow([title])

NUM_SOUP_THREADS = 10
DONE = object()

web_to_soup = Queue()
soup_to_file = Queue()

soup_threads = [
threading.Thread(target=process_html, args=(web_to_soup, soup_to_file, 
i))
for i in range(NUM_SOUP_THREADS)
]

write_thread = threading.Thread(
target=write_csv,  args=(soup_to_file, "tmp.csv", NUM_SOUP_THREADS),
)

write_thread.start()

for thread in soup_threads:
thread.start()

for i in range(100):
web_to_soup.put("#{}".format(i))
for i in range(NUM_SOUP_THREADS):
web_to_soup.put(DONE)

for t in soup_threads:
t.join()
write_thread.join()


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

On 8/27/2017 1:12 PM, MRAB wrote:

What do you mean by "queue (random order)"? A queue is sequential 
order, first-in-first-out. 


With 20 threads requesting 20 different pages, they're not going into 
the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and 
coming in at different times for the parser worker threads to get for 
processing.


Similar situation with a list but I sort the list before giving it to 
the parser, so all the items are in sequential order and fed to the 
parser one at time.


Chris R.

--
https://mail.python.org/mailman/listinfo/python-list


Need advice on writing better test cases.

2017-08-27 Thread Anubhav Yadav
Hello, 
I am a (self-learned) python developer and I write a lot of python code 
everyday. I try to do as much unit testing as possible. But I want to be better 
at it, I want to write more test cases, specially that rely on database 
insertions and reads and file IO. Here are my use-cases for testing. 
How to test if things are going into the database properly or not? 
(mysql/mongo). I want to be able to create a test database environment as 
simple as possible. Create and delete the test environment before each 
functional test case is run. 
Sometimes I write code that read some data from some rabbitmq queue and do 
certain things. How can I write end to end functional test that creates a test 
rabbitmq environment (exchanges and queues) -> wait for sometime -> see if the 
intended work has been done -> delete the test environment. 
I want to be able to make sure that any new commit on my self hosted gitlab 
server should first run all functional test cases first before accepting the 
merge. 
Since we use lot of docker here to deploy modules to productions, I want to 
write functional test cases that test the whole system as a whole and see if 
things are happening the way they are supposed to happen or not. This means 
firing up lot of docker containers, lot of test databases with some data, and 
run all the test cases from an end user point of view. 
Can you suggest me the right python testing frameworks that I should be using? 
Right now I am using unittest to write test cases and manual if/else statements 
to run the functional test cases. 
I try to create rabbitmq queues and bind them to rabbitmq exchanges using the 
pika module. I then run the module using python -m moduleName and then sleep 
for sometime. Then I kill the processs (subprocess) and then I see if the 
intended consequences have happened or not. It's a pain in the ass to be doing 
so many things for test cases. I clearly need to learn how to do things better. 
Any suggestion/book/article/course/video will help me immensely in writing 
better test cases.
Thanks for reading.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Express thanks

2017-08-27 Thread Abdur-Rahmaan Janhangeer
hi,

liking py, i follow py discuss at pretty some places,

i can say that upto now, py mailing lists are awesome

just make a drop on irc ...

Keep it up guys !

Abdur-Rahmaan Janhangeer,
Mauritius
abdurrahmaanjanhangeer.wordpress.com

On 21 Aug 2017 18:38, "Hamish MacDonald"  wrote:

I wanted to give a shout out to the wonderfully passionate contributions to
python I've witnessed following this and   other mailing lists over the
last little bit.

The level of knowledge and willingness to help I've seen are truly
inspiring. Super motivating.

Probably the wrong forum for such a message but what the hey.

Hamish
--
--
https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread MRAB

On 2017-08-27 21:35, Christopher Reimer via Python-list wrote:

On 8/27/2017 1:12 PM, MRAB wrote:

What do you mean by "queue (random order)"? A queue is sequential 
order, first-in-first-out. 


With 20 threads requesting 20 different pages, they're not going into
the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and
coming in at different times for the parser worker threads to get for
processing.

Similar situation with a list but I sort the list before giving it to
the parser, so all the items are in sequential order and fed to the
parser one at time.

What if you don't sort the list? I ask because it sounds like you're 
changing 2 variables (i.e. list->queue, sorted->unsorted) at the same 
time, so you can't be sure that it's the queue that's the problem.

--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

On 8/27/2017 1:31 PM, Peter Otten wrote:


Here's a simple example that extracts titles from generated html. It seems
to work. Does it resemble what you do?
Your example is similar to my code when I'm using a list for the input 
to the parser. You have soup_threads and write_threads, but no read_threads.


The particular website I'm scraping requires checking each page for the 
sentinel value (i.e., "Sorry, no more comments") in order to determine 
when to stop requesting pages. For my comment history that's ~750 pages 
to parse ~11,000 comments.


I have 20 read_threads requesting and putting pages into the output 
queue that is the input_queue for the parser. My soup_threads can get 
items from the queue, but BeautifulSoup doesn't do anything after that.


Chris R.
--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list

On 8/27/2017 1:50 PM, MRAB wrote:
What if you don't sort the list? I ask because it sounds like you're 
changing 2 variables (i.e. list->queue, sorted->unsorted) at the same 
time, so you can't be sure that it's the queue that's the problem.


If I'm using a list, I'm using a for loop to input items into the parser.

If I'm using a queue, I'm using worker threads to put or get items.

The item is still the same whether in a list or a queue.

Chris R.
--
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Paul Rubin
Christopher Reimer  writes:
> I have 20 read_threads requesting and putting pages into the output
> queue that is the input_queue for the parser. 

Given how slow parsing is, you probably want to scrap the pages into
disk files, and then run the parser in parallel processes that read from
the disk.  You could also use something like Redis (redis.io) as a queue.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Peter Otten
Christopher Reimer via Python-list wrote:

> On 8/27/2017 1:31 PM, Peter Otten wrote:
> 
>> Here's a simple example that extracts titles from generated html. It
>> seems to work. Does it resemble what you do?
> Your example is similar to my code when I'm using a list for the input
> to the parser. You have soup_threads and write_threads, but no
> read_threads.
> 
> The particular website I'm scraping requires checking each page for the
> sentinel value (i.e., "Sorry, no more comments") in order to determine
> when to stop requesting pages. 

Where's that check happening? If it's in the soup thread you need some kind 
of back channel to the read threads to inform them that you're need no more 
pages.
 
> For my comment history that's ~750 pages
> to parse ~11,000 comments.
> 
> I have 20 read_threads requesting and putting pages into the output
> queue that is the input_queue for the parser. My soup_threads can get
> items from the queue, but BeautifulSoup doesn't do anything after that.
> 
> Chris R.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: BeautifulSoup doesn't work with a threaded input queue?

2017-08-27 Thread Christopher Reimer via Python-list
Ah, shoot me. I had a .join() statement on the output queue but not on 
in the input queue. So the threads for the input queue got terminated 
before BeautifulSoup could get started. I went down that same rabbit 
hole with CSVWriter the other day. *sigh*


Thanks for everyone's help.

Chris R.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Need advice on writing better test cases.

2017-08-27 Thread Ben Finney
Anubhav Yadav  writes:

> I want to write more test cases, specially that rely on database
> insertions and reads and file IO.

Thanks for taking seriously the importance of test cases for your code!

One important thing to recognise is that a unit test is only one type of
test. It tests one unit of code, typically a function, and should assert
exactly one clearly true-or-false result of calling that code unit.

If you have a function and you want to assert *that function's*
behaviour, you can avoid external dependencies during the test run by
providing fake resources. These can be mocks (e.g. with ‘unittest.mock’)
or other fake resources that are going to behave exactly how you want,
for the purpose of testing the code unit.

Unit test cases:

* Exercise a small unit of code in isolation.
* Each test exactly one obvious behavour of the code unit.
* Aim to have exactly one reason the test case can fail.

Because they are isolated and test a small code unit, they are typically
*fast* and can be run very often, because the entire unit test suite
completes in seconds.

> How to test if things are going into the database properly or not?

That is *not* a unit test; it is a test that one part of your code
has the right effect on some other part of the system. This meets the
description not of a unit test but of an integration test.

These integration tests, because they will likely be a lot slower than
your unit tests, should be in a separate suite of integration tests, to
be run when the time is available to run them.

Integration tests:

* Exercise many code units together.
* Typically make an assertion about the *resulting state* of many
  underlying actions.
* Can have many things that can cause the test case to fail.

> (mysql/mongo). I want to be able to create a test database environment
> as simple as possible. Create and delete the test environment before
> each functional test case is run.

One good article discussion how to make integration tests, specifically
for database integration with your app, is this one
https://julien.danjou.info/blog/2014/db-integration-testing-strategies-python>.

I hope that helps.

-- 
 \ “The enjoyment of one's tools is an essential ingredient of |
  `\ successful work.” —Donald Knuth, _The Art of Computer |
_o__) Programming_ |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list