HTML is not meant to be handled on a line-by-line basis as other text-based formats. According to the specs, HTML is not line-based. Browsers should display the following two HTML snippets identically:
<p#13#10>one#13#10two</p> and <p>one#13#10two</p#13#10> With HTML tags removed both result to: one two As such, a line-based text/markup ratio does not make much sense IMHO, especially since browsers do strip line breaks in most text elements except within <pre> ... </pre>. That said, I believe that DIHtmlParser should care for most of your needs: http://yunqa.de/delphi/doku.php/products/htmlparser/index DIHtmlParser meets most of your requirements: * Not DOM based, very fast. * Hand-crafted, linear-scan Unicode HTML parser. * Handles SCRIPTs and STYLEs well. * Simple "Extract Text" demo included, may be modified as needed. Drawbacks: * Like HTML, DIHtmlParser is not line-based. An option is available to strip or preserve line breaks and white space. * Pre-compiled units available for Delphi only. The source code is required to compile with FreePascal. Ralf On 17.04.2011 14:08, Roland Schäfer wrote: > I feel I have to justify myself: I always do extensive web and list > archive searches before posting to a list (hence the infrequency of my > posts). I had actually found that snippet over a week ago but > immediately discarded it since it is obviously a toy solution. I have a > much better solution already using the PCRE library on a text stream, > sometimes re-reading portions of the stream by way of backtracking. The > problems with any approach like that (esp. 6-liners like the one linked > in your post, but also more elaborate buts still makeshift regular > expression magic) are: > > 1. They don't handle faulty HTML well enough. > > 2. They don't handle any multi-line constructs like comments or scripts. > Depending on how naively you read the input (e.g., using > TStringList.ReadFromFile), they even choke on simple tags with all sorts > of line breaks in between, which are frequently found (and which are, to > my knowledge, not even ill-formed). What do you do with this (for a start)? > > '<div class="al#13#10#13ert">' > > 3. They are potentially not the most efficient solution, which is an > important factor if the stripping alone takes days. > > As a clarification: I am mining several ~500GB results of Heritrix > crawls containig all versions of XML, HTML, inline CSS, inline Scripts, > etc. They need to be accurately stripped from HTML/XML (accurately means > without losing too much real text). The text/markup ratio has to be > calculated and stored on a per-line basis since I'm applying a machine > learning algorithm afterwards which uses those ratios as one factor to > separate coherent text from boilerplate (menus, navigation, copyright etc.). > > I had anticipated a reply along the lines of "read the documents into a > DOM object and extract the text from that". That is also problematic > since it is not fast enough given the size of the input (That is an > assumption; I haven't benchmarked the FPC DOM implementation yet.), and > I don't see how I can calculate the text/markup ratio per line in a > simple fashion when using a DOM implementation. > > I am *not* trying to clean or format simple or limited HTML on a string > basis. For stuff like that, I wouldn't have asked. I actually wouldn't > use Pascal for such tasks but rather sed or a Perl script at max. > > I would still highly appreciate further input. _______________________________________________ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal