Sorry for the noob question, but I've gone through the documentation on python.org, tried some of the diveintopython and boddie's examples, and looked through some of the numerous posts in this group on the subject and I'm still rather confused. I know that there are some great tools out there for doing this (BeautifulSoup, lxml, etc.) but I am trying to accomplish a simple task with a minimal (as in nil) amount of adding in modules that aren't "stock" 2.5, and writing a huge class of my own (or copying one from diveintopython) seems overkill for what I want to do.
Here's what I want to accomplish... I want to open a page, identify a specific point in the page, and turn the information there into plaintext. For example, on the www.diveintopython.org page, I want to turn the paragraph that starts "Translations are freely permitted" (and ends ..."let me know"), into a string variable. Opening the file seems pretty straightforward. >>> import urllib >>> page = urllib.urlopen("http://diveintopython.org/") >>> source = page.read() >>> page.close() gets me to a string variable consisting of the un-parsed contents of the page. Now things get confusing, though, since there appear to be several approaches. One that I read somewhere was: >>> from xml.dom.ext.reader import HtmlLib >>> reader = HtmlLib.Reader() >>> doc = reader.fromString(source) This gets me doc as <HTML Document at 9b4758> >>> paragraphs = doc.getElementsByTagName('p') gets me all of the paragraph children, and the one I specifically want can then be referenced with: paragraphs[5] This method seems to be pretty straightforward, but what do I do with it to get it into a string cleanly? >>> from xml.dom.ext import PrettyPrint >>> PrettyPrint(paragraphs[5]) shows me the text, but still in html, and I can't seem to get it to turn into a string variable, and I think the PrettyPrint function is unnecessary for what I want to do. Formatter seems to do what I want, but I can't figure out how to link the "Element Node" at paragraphs[5] with the formatter functions to produce the string I want as output. I tried some of the htmllib.HTMLParser(formatter stuff) examples, but while I can supposedly get that to work with formatter a little easier, I can't figure out how to get HTMLParser to drill down specifically to the 6th paragraph's contents. Thanks in advance. - A. -- http://mail.python.org/mailman/listinfo/python-list