The following module was proposed for inclusion in the Module List:
modid: HTML::Dirty DSLIP: bdpOp description: Parser for dirty, messed up HTML userid: MIKO (Miko O'Sullivan) chapterid: 15 (World_Wide_Web_HTML_HTTP_CGI) communities: similar: HTML::Parser rationale: HTML::Dirty was created when I was attempting to parse some pages on the web and HTML::Parser couldn't handle the sloppy, syntactically messed up pages it was running into. When I found a page that displayed several hundred links in Netscape and IE, but HTML::Parser only found two of them, I decided to grow my own. The concept of parsing HTML that is known to be non-conforming is, admittedly, almost a contradiction: if it's non-conforming, how do you know how to parse it? There are two answers to this question. First, HTML::Dirty doesn't attempt to build a full element tree out of the tags. It just creates an array of tokens representing the text, tags, endtags, declarations, and comments. I've found that the array is quite sufficient for my HTML parsing needs. Second, HTML::Dirty was designed to attempt to parse HTML in the same way the popular browsers do. Right or wrong, the popular browsers set the de facto standard of how HTML is written, and if you're going to attempt to parse HTML from public web pages you'll have to deal with the mess that's out there. enteredby: MIKO (Miko O'Sullivan) enteredon: Thu Dec 13 00:52:34 2001 GMT The resulting entry would be: HTML:: ::Dirty bdpOp Parser for dirty, messed up HTML MIKO Thanks for registering, The Pause Team PS: The following links are only valid for module list maintainers: Registration form with editing capabilities: https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=58200000_b3bc0601f6901f31&SUBMIT_pause99_add_mod_preview=1 Immediate (one click) registration: https://pause.perl.org/pause/authenquery?ACTION=add_mod&USERID=58200000_b3bc0601f6901f31&SUBMIT_pause99_add_mod_insertit=1