On 22 Jan, 06:31, Alnilam <[EMAIL PROTECTED]> wrote: > Sorry for the noob question, but I've gone through the documentation > on python.org, tried some of the diveintopython and boddie's examples, > and looked through some of the numerous posts in this group on the > subject and I'm still rather confused. I know that there are some > great tools out there for doing this (BeautifulSoup, lxml, etc.) but I > am trying to accomplish a simple task with a minimal (as in nil) > amount of adding in modules that aren't "stock" 2.5, and writing a > huge class of my own (or copying one from diveintopython) seems > overkill for what I want to do.
It's unfortunate that you don't want to install extra modules, but I'd probably use libxml2dom [1] for what you're about to describe... > Here's what I want to accomplish... I want to open a page, identify a > specific point in the page, and turn the information there into > plaintext. For example, on thewww.diveintopython.orgpage, I want to > turn the paragraph that starts "Translations are freely > permitted" (and ends ..."let me know"), into a string variable. > > Opening the file seems pretty straightforward. > > >>> import urllib > >>> page = urllib.urlopen("http://diveintopython.org/") > >>> source = page.read() > >>> page.close() > > gets me to a string variable consisting of the un-parsed contents of > the page. Yes, there may be shortcuts that let some parsers read directly from the server, but it's always good to have the page text around, anyway. > Now things get confusing, though, since there appear to be several > approaches. > One that I read somewhere was: > > >>> from xml.dom.ext.reader import HtmlLib > >>> reader = HtmlLib.Reader() > >>> doc = reader.fromString(source) > > This gets me doc as <HTML Document at 9b4758> > > >>> paragraphs = doc.getElementsByTagName('p') > > gets me all of the paragraph children, and the one I specifically want > can then be referenced with: paragraphs[5] This method seems to be > pretty straightforward, but what do I do with it to get it into a > string cleanly? In less sophisticated DOM implementations, what you'd do is to loop over the "descendant" nodes of the paragraph which are text nodes and concatenate them. > >>> from xml.dom.ext import PrettyPrint > >>> PrettyPrint(paragraphs[5]) > > shows me the text, but still in html, and I can't seem to get it to > turn into a string variable, and I think the PrettyPrint function is > unnecessary for what I want to do. Yes, PrettyPrint is for prettyprinting XML. You just want to visit and collect the text nodes. > Formatter seems to do what I want, > but I can't figure out how to link the "Element Node" at > paragraphs[5] with the formatter functions to produce the string I > want as output. I tried some of the htmllib.HTMLParser(formatter > stuff) examples, but while I can supposedly get that to work with > formatter a little easier, I can't figure out how to get HTMLParser to > drill down specifically to the 6th paragraph's contents. Given that you've found the paragraph above, you just need to write a recursive function which visits child nodes, and if it finds a text node then it collects the value of the node in a list; otherwise, for elements, it visits the child nodes of that element; and so on. The recursive approach is presumably what the formatter uses, but I can't say that I've really looked at it. Meanwhile, with libxml2dom, you'd do something like this: import libxml2dom d = libxml2dom.parseURI("http://www.diveintopython.org/", html=1) saved = None # Find the paragraphs. for p in d.xpath("//p"): # Get the text without leading and trailing space. text = p.textContent.strip() # Save the appropriate paragraph text. if text.startswith("Translations are freely permitted") and \ text.endswith("just let me know."): saved = text break The magic part of this code which saves you from needing to write that recursive function mentioned above is the textContent property on the paragraph element. Paul [1] http://www.python.org/pypi/libxml2dom -- http://mail.python.org/mailman/listinfo/python-list