On Jan 22, 7:29 pm, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > > > I was asking this community if there was a simple way to use only the > > tools included with Python to parse a bit of html. > > If you *know* that your document is valid HTML, you can use the HTMLParser > module in the standard Python library. Or even the parser in the htmllib > module. But a lot of HTML pages out there are invalid, some are grossly > invalid, and those parsers are just unable to handle them. This is why > modules like BeautifulSoup exist: they contain a lot of heuristics and > trial-and-error and personal experience from the developers, in order to > guess more or less what the page author intended to write and make some > sense of that "tag soup". > A guesswork like that is not suitable for the std lib ("Errors should > never pass silently" and "In the face of ambiguity, refuse the temptation > to guess.") but makes a perfect 3rd party module. > > If you want to use regular expressions, and that works OK for the > documents you are handling now, fine. But don't complain when your RE's > match too much or too little or don't match at all because of unclosed > tags, improperly nested tags, nonsense markup, or just a valid combination > that you didn't take into account. > > -- > Gabriel Genellina
Thanks, Gabriel. That does make sense, both what the benefits of BeautifulSoup are and why it probably won't become std lib anytime soon. The pages I'm trying to write this code to run against aren't in the wild, though. They are static html files on my company's lan, are very consistent in format, and are (I believe) valid html. They just have specific paragraphs of useful information, located in the same place in each file, that I want to 'harvest' and put to better use. I used diveintopython.org as an example only (and in part because it had good clean html formatting). I am pretty sure that I could craft some regular expressions to do the work -- which of course would not be the case if I was screen scraping web pages in the 'wild' -- but I was trying to find a way to do that using one of those std libs you mentioned. I'm not sure if HTMLParser or htmllib would work better to achieve the same effect as the regex example I gave above, or how to get them to do that. I thought I'd come close, but as someone pointed out early on, I'd accidently tapped into PyXML which is installed where I was testing code, but not necessarily where I need it. It may turn out that the regex way works faster, but falling back on methods I'm comfortable with doesn't help expand my Python knowledge. So if anyone can tell me how to get HTMLParser or htmllib to grab a specific paragraph, and then provide the text in that paragraph in a clean, markup-free format, I'd appreciate it. -- http://mail.python.org/mailman/listinfo/python-list