BeautifulSoup doesn't work with a threaded input queue?
Greetings, I have Python 3.6 script on Windows to scrape comment history from a website. It's currently set up this way: Requestor (threads) -> list -> Parser (threads) -> queue -> CVSWriter (single thread) It takes 15 minutes to process ~11,000 comments. When I replaced the list with a queue between the Requestor and Parser to speed up things, BeautifulSoup stopped working. When I changed BeautifulSoup(contents, "lxml") to BeautifulSoup(contents), I get the UserWarning that no parser wasn't explicitly set and a reference to line 80 in threading.py (which puts it in the RLock factory function). When I switched back to using list between the Requestor and Parser, the Parser worked again. BeautifulSoup doesn't work with a threaded input queue? Thank you, Chris Reimer -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimer via Python-list wrote: > Greetings, > > I have Python 3.6 script on Windows to scrape comment history from a > website. It's currently set up this way: > > Requestor (threads) -> list -> Parser (threads) -> queue -> CVSWriter > (single thread) > > It takes 15 minutes to process ~11,000 comments. > > When I replaced the list with a queue between the Requestor and Parser > to speed up things, BeautifulSoup stopped working. > > When I changed BeautifulSoup(contents, "lxml") to > BeautifulSoup(contents), I get the UserWarning that no parser wasn't > explicitly set and a reference to line 80 in threading.py (which puts it > in the RLock factory function). > > When I switched back to using list between the Requestor and Parser, the > Parser worked again. > > BeautifulSoup doesn't work with a threaded input queue? The documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup says you can make the BeautifulSoup object from a string or file. Can you give a few more details where the queue comes into play? A small code sample would be ideal... -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 8/27/2017 11:54 AM, Peter Otten wrote: The documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup says you can make the BeautifulSoup object from a string or file. Can you give a few more details where the queue comes into play? A small code sample would be ideal. A worker thread uses a request object to get the page and puts it into queue as page.content (HTML). Another worker thread gets the page.content from the queue to apply BeautifulSoup and nothing happens. soup = BeautifulSoup(page_content, 'lxml') print(soup) No output whatsoever. If I remove 'lxml', I get the UserWarning that no parser wasn't explicitly set and get the reference to threading.py at line 80. I verified that page.content that goes into and out of the queue is the same page.content that goes into and out of a list. I read somewhere that BeautifulSoup may not be thread-safe. I've never had a problem with threads storing the output into a queue. Using a queue (random order) instead of a list (sequential order) to feed pages for the input is making it wonky. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 2017-08-27 20:35, Christopher Reimer via Python-list wrote: On 8/27/2017 11:54 AM, Peter Otten wrote: The documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup says you can make the BeautifulSoup object from a string or file. Can you give a few more details where the queue comes into play? A small code sample would be ideal. A worker thread uses a request object to get the page and puts it into queue as page.content (HTML). Another worker thread gets the page.content from the queue to apply BeautifulSoup and nothing happens. soup = BeautifulSoup(page_content, 'lxml') print(soup) No output whatsoever. If I remove 'lxml', I get the UserWarning that no parser wasn't explicitly set and get the reference to threading.py at line 80. I verified that page.content that goes into and out of the queue is the same page.content that goes into and out of a list. I read somewhere that BeautifulSoup may not be thread-safe. I've never had a problem with threads storing the output into a queue. Using a queue (random order) instead of a list (sequential order) to feed pages for the input is making it wonky. What do you mean by "queue (random order)"? A queue is sequential order, first-in-first-out. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimer via Python-list wrote: > On 8/27/2017 11:54 AM, Peter Otten wrote: > >> The documentation >> >> https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup >> >> says you can make the BeautifulSoup object from a string or file. >> Can you give a few more details where the queue comes into play? A small >> code sample would be ideal. > > A worker thread uses a request object to get the page and puts it into > queue as page.content (HTML). Another worker thread gets the > page.content from the queue to apply BeautifulSoup and nothing happens. > > soup = BeautifulSoup(page_content, 'lxml') > print(soup) > > No output whatsoever. If I remove 'lxml', I get the UserWarning that no > parser wasn't explicitly set and get the reference to threading.py at > line 80. > > I verified that page.content that goes into and out of the queue is the > same page.content that goes into and out of a list. > > I read somewhere that BeautifulSoup may not be thread-safe. I've never > had a problem with threads storing the output into a queue. Using a > queue (random order) instead of a list (sequential order) to feed pages > for the input is making it wonky. Here's a simple example that extracts titles from generated html. It seems to work. Does it resemble what you do? import csv import threading import time from queue import Queue import bs4 def process_html(source, dest, index): while True: html = source.get() if html is DONE: dest.put(DONE) break soup = bs4.BeautifulSoup(html, "lxml") dest.put(soup.find("title").text) def write_csv(source, filename, to_go): with open(filename, "w") as f: writer = csv.writer(f) while True: title = source.get() if title is DONE: to_go -= 1 if not to_go: return else: writer.writerow([title]) NUM_SOUP_THREADS = 10 DONE = object() web_to_soup = Queue() soup_to_file = Queue() soup_threads = [ threading.Thread(target=process_html, args=(web_to_soup, soup_to_file, i)) for i in range(NUM_SOUP_THREADS) ] write_thread = threading.Thread( target=write_csv, args=(soup_to_file, "tmp.csv", NUM_SOUP_THREADS), ) write_thread.start() for thread in soup_threads: thread.start() for i in range(100): web_to_soup.put("#{}".format(i)) for i in range(NUM_SOUP_THREADS): web_to_soup.put(DONE) for t in soup_threads: t.join() write_thread.join() -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 8/27/2017 1:12 PM, MRAB wrote: What do you mean by "queue (random order)"? A queue is sequential order, first-in-first-out. With 20 threads requesting 20 different pages, they're not going into the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and coming in at different times for the parser worker threads to get for processing. Similar situation with a list but I sort the list before giving it to the parser, so all the items are in sequential order and fed to the parser one at time. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Need advice on writing better test cases.
Hello, I am a (self-learned) python developer and I write a lot of python code everyday. I try to do as much unit testing as possible. But I want to be better at it, I want to write more test cases, specially that rely on database insertions and reads and file IO. Here are my use-cases for testing. How to test if things are going into the database properly or not? (mysql/mongo). I want to be able to create a test database environment as simple as possible. Create and delete the test environment before each functional test case is run. Sometimes I write code that read some data from some rabbitmq queue and do certain things. How can I write end to end functional test that creates a test rabbitmq environment (exchanges and queues) -> wait for sometime -> see if the intended work has been done -> delete the test environment. I want to be able to make sure that any new commit on my self hosted gitlab server should first run all functional test cases first before accepting the merge. Since we use lot of docker here to deploy modules to productions, I want to write functional test cases that test the whole system as a whole and see if things are happening the way they are supposed to happen or not. This means firing up lot of docker containers, lot of test databases with some data, and run all the test cases from an end user point of view. Can you suggest me the right python testing frameworks that I should be using? Right now I am using unittest to write test cases and manual if/else statements to run the functional test cases. I try to create rabbitmq queues and bind them to rabbitmq exchanges using the pika module. I then run the module using python -m moduleName and then sleep for sometime. Then I kill the processs (subprocess) and then I see if the intended consequences have happened or not. It's a pain in the ass to be doing so many things for test cases. I clearly need to learn how to do things better. Any suggestion/book/article/course/video will help me immensely in writing better test cases. Thanks for reading. -- https://mail.python.org/mailman/listinfo/python-list
Re: Express thanks
hi, liking py, i follow py discuss at pretty some places, i can say that upto now, py mailing lists are awesome just make a drop on irc ... Keep it up guys ! Abdur-Rahmaan Janhangeer, Mauritius abdurrahmaanjanhangeer.wordpress.com On 21 Aug 2017 18:38, "Hamish MacDonald" wrote: I wanted to give a shout out to the wonderfully passionate contributions to python I've witnessed following this and other mailing lists over the last little bit. The level of knowledge and willingness to help I've seen are truly inspiring. Super motivating. Probably the wrong forum for such a message but what the hey. Hamish -- -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 2017-08-27 21:35, Christopher Reimer via Python-list wrote: On 8/27/2017 1:12 PM, MRAB wrote: What do you mean by "queue (random order)"? A queue is sequential order, first-in-first-out. With 20 threads requesting 20 different pages, they're not going into the queue in sequential order (i.e., 0, 1, 2, ..., 17, 18, 19) and coming in at different times for the parser worker threads to get for processing. Similar situation with a list but I sort the list before giving it to the parser, so all the items are in sequential order and fed to the parser one at time. What if you don't sort the list? I ask because it sounds like you're changing 2 variables (i.e. list->queue, sorted->unsorted) at the same time, so you can't be sure that it's the queue that's the problem. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 8/27/2017 1:31 PM, Peter Otten wrote: Here's a simple example that extracts titles from generated html. It seems to work. Does it resemble what you do? Your example is similar to my code when I'm using a list for the input to the parser. You have soup_threads and write_threads, but no read_threads. The particular website I'm scraping requires checking each page for the sentinel value (i.e., "Sorry, no more comments") in order to determine when to stop requesting pages. For my comment history that's ~750 pages to parse ~11,000 comments. I have 20 read_threads requesting and putting pages into the output queue that is the input_queue for the parser. My soup_threads can get items from the queue, but BeautifulSoup doesn't do anything after that. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
On 8/27/2017 1:50 PM, MRAB wrote: What if you don't sort the list? I ask because it sounds like you're changing 2 variables (i.e. list->queue, sorted->unsorted) at the same time, so you can't be sure that it's the queue that's the problem. If I'm using a list, I'm using a for loop to input items into the parser. If I'm using a queue, I'm using worker threads to put or get items. The item is still the same whether in a list or a queue. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimer writes: > I have 20 read_threads requesting and putting pages into the output > queue that is the input_queue for the parser. Given how slow parsing is, you probably want to scrap the pages into disk files, and then run the parser in parallel processes that read from the disk. You could also use something like Redis (redis.io) as a queue. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Christopher Reimer via Python-list wrote: > On 8/27/2017 1:31 PM, Peter Otten wrote: > >> Here's a simple example that extracts titles from generated html. It >> seems to work. Does it resemble what you do? > Your example is similar to my code when I'm using a list for the input > to the parser. You have soup_threads and write_threads, but no > read_threads. > > The particular website I'm scraping requires checking each page for the > sentinel value (i.e., "Sorry, no more comments") in order to determine > when to stop requesting pages. Where's that check happening? If it's in the soup thread you need some kind of back channel to the read threads to inform them that you're need no more pages. > For my comment history that's ~750 pages > to parse ~11,000 comments. > > I have 20 read_threads requesting and putting pages into the output > queue that is the input_queue for the parser. My soup_threads can get > items from the queue, but BeautifulSoup doesn't do anything after that. > > Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: BeautifulSoup doesn't work with a threaded input queue?
Ah, shoot me. I had a .join() statement on the output queue but not on in the input queue. So the threads for the input queue got terminated before BeautifulSoup could get started. I went down that same rabbit hole with CSVWriter the other day. *sigh* Thanks for everyone's help. Chris R. -- https://mail.python.org/mailman/listinfo/python-list
Re: Need advice on writing better test cases.
Anubhav Yadav writes: > I want to write more test cases, specially that rely on database > insertions and reads and file IO. Thanks for taking seriously the importance of test cases for your code! One important thing to recognise is that a unit test is only one type of test. It tests one unit of code, typically a function, and should assert exactly one clearly true-or-false result of calling that code unit. If you have a function and you want to assert *that function's* behaviour, you can avoid external dependencies during the test run by providing fake resources. These can be mocks (e.g. with ‘unittest.mock’) or other fake resources that are going to behave exactly how you want, for the purpose of testing the code unit. Unit test cases: * Exercise a small unit of code in isolation. * Each test exactly one obvious behavour of the code unit. * Aim to have exactly one reason the test case can fail. Because they are isolated and test a small code unit, they are typically *fast* and can be run very often, because the entire unit test suite completes in seconds. > How to test if things are going into the database properly or not? That is *not* a unit test; it is a test that one part of your code has the right effect on some other part of the system. This meets the description not of a unit test but of an integration test. These integration tests, because they will likely be a lot slower than your unit tests, should be in a separate suite of integration tests, to be run when the time is available to run them. Integration tests: * Exercise many code units together. * Typically make an assertion about the *resulting state* of many underlying actions. * Can have many things that can cause the test case to fail. > (mysql/mongo). I want to be able to create a test database environment > as simple as possible. Create and delete the test environment before > each functional test case is run. One good article discussion how to make integration tests, specifically for database integration with your app, is this one https://julien.danjou.info/blog/2014/db-integration-testing-strategies-python>. I hope that helps. -- \ “The enjoyment of one's tools is an essential ingredient of | `\ successful work.” —Donald Knuth, _The Art of Computer | _o__) Programming_ | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list