To maintain paragraphs, replace any p or br tags with your favorite operating system's crlf.
On Jul 13, 8:57 am, kublai <[EMAIL PROTECTED]> wrote: > On Jul 13, 5:44 pm, Paul McGuire <[EMAIL PROTECTED]> wrote: > > > > > On Jul 12, 4:42 am, kublai <[EMAIL PROTECTED]> wrote: > > > > Hello, > > > > For a project, I need to develop a corpus of online news stories. I'm > > > looking for an application that, given the url of a web page, "copies" > > > the rendered text of the web page (not the source HTNL text), opens a > > > text editor (Notepad), and displays the copied text for the user to > > > examine and save into a text file. Graphics and sidebars to be > > > ignored. The examples I have come across are much too complex for me > > > to customize for this simple job. Can anyone lead me to the right > > > direction? > > > > Thanks, > > > gk > > > One of the examples provided with pyparsing is an HTML stripper - view > > it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py. > > > -- Paul > > Stripping tags is indeed one strategy that came to mind. I'm wondering > how much information (for example, paragraphing) would be lost, and if > what would be lost would be acceptable (to the project). I looked at > pyparsing and I see that it's got a lot of text processing > capabilities that I can use along the way. I sure will try it. Thanks > for the post. > > Best, > gk -- http://mail.python.org/mailman/listinfo/python-list