Hi All, I am involved in one project which tends to collect news information published on selected, known web sites inthe format of HTML, RSS, etc and sortlist them and create a bookmark on our website for the news content(we will use django for web development). Currently this project is under heavy development.
I need a help on HTML parser. I can download the web pages from target sites. Then I have to start doing parsing. Since they all html web pages, they will have different styles, tags, it is very hard for me to parse the data. So what we plan is to have one or more rules for each website and run based on rule. We can even write some small amount of code for each web site if required. But Crawler, Parser and Indexer need to run unattended. I don't know how to proceed next.. I saw a couple of python parsers like pyparsing, yappy, yapps, etc but they havn't given any example for HTML parsing. Someone recommended using "lynx" to convert the page into the text and parse the data. That also looks good but still i end of writing a huge chunk of code for each web page. What we need is, One nice parser which should work on HTML/text file (lynx output) and work based on certain rules and return us a result (Am I need magix to do this :-( ) Sorry about my english.. Thanks & Regards, Krish -- http://mail.python.org/mailman/listinfo/python-list