Comment Holder <commenthol...@gmail.com> writes: > Hi, > I am totally new to Python. I noticed that there are many videos showing how > to collect data from Python, but I am not sure if I would be able to > accomplish my goal using Python so I can start learning. > > Here is the example of the target page: > http://and.medianewsonline.com/hello.html > In this example, there are 10 articles. > > What I exactly need is to do the following: > 1- Collect the article title, date, source, and contents. > 2- I need to be able to export the final results to excel or a database > client. That is, I need to have all of those specified in step 1 in one row, > while each of them saved in separate column. For example: > > Title1 Date1 Source1 Contents1 > Title2 Date2 Source2 Contents2 > > I appreciate any advise regarding my case. > > Thanks & Regards//
Here is an attempt for you. It uses BeatifulSoup 4. It is written in Python 3.3, so if you want to use Python 2.x you will have to make some small changes, like from urllib import urlopen and probably something with the print statements. The formatting in columns is left as an exercise for you. I wonder how you would want that with multiparagraph contents.
from bs4 import BeautifulSoup from urllib.request import urlopen URL = "http://and.medianewsonline.com/hello.html" html = urlopen(URL).read() soup = BeautifulSoup(html) arts = soup.find_all('div', class_='articleHeader') for art in arts: name = art.contents[0].string.strip() print(name) artbody = art.find_next_sibling('div', class_='article enArticle') titlenode = artbody.find_next('div', id='hd') title = titlenode.get_text().strip() print("Title: {0}".format(title)) srcnode = titlenode.find_next('a') while srcnode.parent.get('class') == ['author']: srcnode=srcnode.find_next('a') source = srcnode.string srcnode = srcnode.parent date = srcnode.find_previous_sibling('div').string print("Date: {0}".format(date)) print("Source: {0}".format(source)) cont = srcnode.find_next_siblings('p', class_='articleParagraph enarticleParagraph') contents = '\n'.join([c.get_text() for c in cont]) print("Contents: {0}".format(contents))
-- Piet van Oostrum <p...@vanoostrum.org> WWW: http://pietvanoostrum.com/ PGP key: [8DAE142BE17999C4]
-- http://mail.python.org/mailman/listinfo/python-list