On Jan 22, 4:31 pm, Alnilam <[EMAIL PROTECTED]> wrote: > Sorry for the noob question, but I've gone through the documentation > on python.org, tried some of the diveintopython and boddie's examples, > and looked through some of the numerous posts in this group on the > subject and I'm still rather confused. I know that there are some > great tools out there for doing this (BeautifulSoup, lxml, etc.) but I > am trying to accomplish a simple task with a minimal (as in nil) > amount of adding in modules that aren't "stock" 2.5, and writing a > huge class of my own (or copying one from diveintopython) seems > overkill for what I want to do. > > Here's what I want to accomplish... I want to open a page, identify a > specific point in the page, and turn the information there into > plaintext. For example, on thewww.diveintopython.orgpage, I want to > turn the paragraph that starts "Translations are freely > permitted" (and ends ..."let me know"), into a string variable. > > Opening the file seems pretty straightforward. > > >>> import urllib > >>> page = urllib.urlopen("http://diveintopython.org/") > >>> source = page.read() > >>> page.close() > > gets me to a string variable consisting of the un-parsed contents of > the page. > Now things get confusing, though, since there appear to be several > approaches. > One that I read somewhere was: > > >>> from xml.dom.ext.reader import HtmlLib
Pardon me, but the standard issue Python 2.n (for n in range(5, 2, -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous 200-modules PyXML package installed. And you don't want the 75Kb BeautifulSoup? -- http://mail.python.org/mailman/listinfo/python-list