[fpc-pascal] Re: stripping HTML
http://www.festra.com/eng/snip12.htm Simple googling gives a lot of results, try: html strip (pascal OR delphi) -- View this message in context: http://free-pascal-general.1045716.n5.nabble.com/stripping-HTML-tp4307374p4308621.html Sent from the Free Pascal - General mailing list archive at Nabble.com. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Re: stripping HTML
On 4/17/2011 11:00 AM, leledumbo wrote: > http://www.festra.com/eng/snip12.htm > Simple googling gives a lot of results, try: html strip (pascal OR delphi) Thank you for your reply. I feel I have to justify myself: I always do extensive web and list archive searches before posting to a list (hence the infrequency of my posts). I had actually found that snippet over a week ago but immediately discarded it since it is obviously a toy solution. I have a much better solution already using the PCRE library on a text stream, sometimes re-reading portions of the stream by way of backtracking. The problems with any approach like that (esp. 6-liners like the one linked in your post, but also more elaborate buts still makeshift regular expression magic) are: 1. They don't handle faulty HTML well enough. 2. They don't handle any multi-line constructs like comments or scripts. Depending on how naively you read the input (e.g., using TStringList.ReadFromFile), they even choke on simple tags with all sorts of line breaks in between, which are frequently found (and which are, to my knowledge, not even ill-formed). What do you do with this (for a start)? '' 3. They are potentially not the most efficient solution, which is an important factor if the stripping alone takes days. As a clarification: I am mining several ~500GB results of Heritrix crawls containig all versions of XML, HTML, inline CSS, inline Scripts, etc. They need to be accurately stripped from HTML/XML (accurately means without losing too much real text). The text/markup ratio has to be calculated and stored on a per-line basis since I'm applying a machine learning algorithm afterwards which uses those ratios as one factor to separate coherent text from boilerplate (menus, navigation, copyright etc.). I had anticipated a reply along the lines of "read the documents into a DOM object and extract the text from that". That is also problematic since it is not fast enough given the size of the input (That is an assumption; I haven't benchmarked the FPC DOM implementation yet.), and I don't see how I can calculate the text/markup ratio per line in a simple fashion when using a DOM implementation. I am *not* trying to clean or format simple or limited HTML on a string basis. For stuff like that, I wouldn't have asked. I actually wouldn't use Pascal for such tasks but rather sed or a Perl script at max. I would still highly appreciate further input. Regards Roland ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Re: stripping HTML
HTML is not meant to be handled on a line-by-line basis as other text-based formats. According to the specs, HTML is not line-based. Browsers should display the following two HTML snippets identically: one#13#10two and one#13#10two With HTML tags removed both result to: one two As such, a line-based text/markup ratio does not make much sense IMHO, especially since browsers do strip line breaks in most text elements except within ... . That said, I believe that DIHtmlParser should care for most of your needs: http://yunqa.de/delphi/doku.php/products/htmlparser/index DIHtmlParser meets most of your requirements: * Not DOM based, very fast. * Hand-crafted, linear-scan Unicode HTML parser. * Handles SCRIPTs and STYLEs well. * Simple "Extract Text" demo included, may be modified as needed. Drawbacks: * Like HTML, DIHtmlParser is not line-based. An option is available to strip or preserve line breaks and white space. * Pre-compiled units available for Delphi only. The source code is required to compile with FreePascal. Ralf On 17.04.2011 14:08, Roland Schäfer wrote: > I feel I have to justify myself: I always do extensive web and list > archive searches before posting to a list (hence the infrequency of my > posts). I had actually found that snippet over a week ago but > immediately discarded it since it is obviously a toy solution. I have a > much better solution already using the PCRE library on a text stream, > sometimes re-reading portions of the stream by way of backtracking. The > problems with any approach like that (esp. 6-liners like the one linked > in your post, but also more elaborate buts still makeshift regular > expression magic) are: > > 1. They don't handle faulty HTML well enough. > > 2. They don't handle any multi-line constructs like comments or scripts. > Depending on how naively you read the input (e.g., using > TStringList.ReadFromFile), they even choke on simple tags with all sorts > of line breaks in between, which are frequently found (and which are, to > my knowledge, not even ill-formed). What do you do with this (for a start)? > > '' > > 3. They are potentially not the most efficient solution, which is an > important factor if the stripping alone takes days. > > As a clarification: I am mining several ~500GB results of Heritrix > crawls containig all versions of XML, HTML, inline CSS, inline Scripts, > etc. They need to be accurately stripped from HTML/XML (accurately means > without losing too much real text). The text/markup ratio has to be > calculated and stored on a per-line basis since I'm applying a machine > learning algorithm afterwards which uses those ratios as one factor to > separate coherent text from boilerplate (menus, navigation, copyright etc.). > > I had anticipated a reply along the lines of "read the documents into a > DOM object and extract the text from that". That is also problematic > since it is not fast enough given the size of the input (That is an > assumption; I haven't benchmarked the FPC DOM implementation yet.), and > I don't see how I can calculate the text/markup ratio per line in a > simple fashion when using a DOM implementation. > > I am *not* trying to clean or format simple or limited HTML on a string > basis. For stuff like that, I wouldn't have asked. I actually wouldn't > use Pascal for such tasks but rather sed or a Perl script at max. > > I would still highly appreciate further input. ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Re: stripping HTML
Thanks a lot for your reply. On 4/17/2011 3:46 PM, Ralf Junker wrote: > HTML is not meant to be handled on a line-by-line basis as other > text-based formats. According to the specs, HTML is not line-based. > Browsers should display the following two HTML snippets identically: [...] > As such, a line-based text/markup ratio does not make much sense IMHO, > especially since browsers do strip line breaks in most text elements > except within ... . This is sort of off-topic, so I'll make it short: Yes, that is a problem we are aware of. However, experiments with even simple threshholds ("remove lines with less than 50% text") were sort of successful. Simple machine learning makes it much better. To avoid true paragraph detection (which would be desirable but costly given the TB-sized input) we are also experimenting with several line-based and non-line-based windows on the input and cumulative html/text ratios for those windows. Also, this is only stage one of the cleanup, and we run some more linguistically informed and costly steps on the already much smaller amounts of data. Maybe I'll give paragraph detection based on , etc. another try, but we actually decided against that a while ago because we lost huge amounts of valuable input due to non-use or very creative use of such elements in actual web pages. > That said, I believe that DIHtmlParser should care for most of your needs: Yes, that looks perfect. I wouldn't even have a problem with the license or with paying for it, and I even still have D7. However, my program has to run on our Debian 64-bit servers. Regards Roland ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Re: stripping HTML
2011/4/17 Roland Schäfer : > Yes, that looks perfect. I wouldn't even have a problem with the license > or with paying for it, and I even still have D7. However, my program has > to run on our Debian 64-bit servers. You could contact the authors and say that you would like to buy a license if it works in FPC linux-x86-64 -- Felipe Monteiro de Carvalho ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal
Re: [fpc-pascal] Re: stripping HTML
On 17.04.2011 16:30, Felipe Monteiro de Carvalho wrote: >> Yes, that looks perfect. I wouldn't even have a problem with the license >> or with paying for it, and I even still have D7. However, my program has >> to run on our Debian 64-bit servers. > > You could contact the authors and say that you would like to buy a > license if it works in FPC linux-x86-64 I am the author of DIHtmlParser. I do not know if DIHtmlParser compiles and works in FPC linux-x86-64 because I do not have that environment available for testing. Unfortunately, low demand for that platform does not justify setting it up and supporting it on a regular basis. I can say, however, that the latest version source code is Pascal only and compiles on FPC Win32 without platform warnings. But I do suspect that it will need a few IFDEFs to make it Linux compatible. If so, I would of course be glad to add any to the code so they will be available in future versions. However, having read Stefan's more detailed, off-topic requirements description, I'd rather suggest that they come up with their own HTML parser and text filter. It sounds too specific to me to be handled by any standard component already available. Ralf ___ fpc-pascal maillist - fpc-pascal@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-pascal