a combination of urllib, urlib2 and BeautifulSoup should do it. Read BeautifulSoup's documentation to know how to browse through the DOM.
[EMAIL PROTECTED] a écrit : > Hi All, > > I am involved in one project which tends to collect news > information published on selected, known web sites inthe format of > HTML, RSS, etc and sortlist them and create a bookmark on our website > for the news content(we will use django for web development). Currently > this project is under heavy development. > > I need a help on HTML parser. > > I can download the web pages from target sites. Then I have to start > doing parsing. Since they all html web pages, they will have different > styles, tags, it is very hard for me to parse the data. So what we plan > is to have one or more rules for each website and run based on rule. We > can even write some small amount of code for each web site if > required. But Crawler, Parser and Indexer need to run unattended. I > don't know how to proceed next.. > > I saw a couple of python parsers like pyparsing, yappy, yapps, etc but > they havn't given any example for HTML parsing. Someone recommended > using "lynx" to convert the page into the text and parse the data. That > also looks good but still i end of writing a huge chunk of code for > each web page. > > What we need is, > > One nice parser which should work on HTML/text file (lynx output) and > work based on certain rules and return us a result (Am I need magix to > do this :-( ) > > Sorry about my english.. > > Thanks & Regards, > > Krish
-- http://mail.python.org/mailman/listinfo/python-list