[fpc-pascal] Re: stripping HTML

2011-04-17 Thread leledumbo
http://www.festra.com/eng/snip12.htm
Simple googling gives a lot of results, try: html strip (pascal OR delphi)

--
View this message in context: 
http://free-pascal-general.1045716.n5.nabble.com/stripping-HTML-tp4307374p4308621.html
Sent from the Free Pascal - General mailing list archive at Nabble.com.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Re: stripping HTML

2011-04-17 Thread Roland Schäfer
On 4/17/2011 11:00 AM, leledumbo wrote:
> http://www.festra.com/eng/snip12.htm
> Simple googling gives a lot of results, try: html strip (pascal OR delphi)

Thank you for your reply.

I feel I have to justify myself: I always do extensive web and list
archive searches before posting to a list (hence the infrequency of my
posts). I had actually found that snippet over a week ago but
immediately discarded it since it is obviously a toy solution. I have a
much better solution already using the PCRE library on a text stream,
sometimes re-reading portions of the stream by way of backtracking. The
problems with any approach like that (esp. 6-liners like the one linked
in your post, but also more elaborate buts still makeshift regular
expression magic) are:

1. They don't handle faulty HTML well enough.

2. They don't handle any multi-line constructs like comments or scripts.
Depending on how naively you read the input (e.g., using
TStringList.ReadFromFile), they even choke on simple tags with all sorts
of line breaks in between, which are frequently found (and which are, to
my knowledge, not even ill-formed). What do you do with this (for a start)?

''

3. They are potentially not the most efficient solution, which is an
important factor if the stripping alone takes days.

As a clarification: I am mining several ~500GB results of Heritrix
crawls containig all versions of XML, HTML, inline CSS, inline Scripts,
etc. They need to be accurately stripped from HTML/XML (accurately means
without losing too much real text). The text/markup ratio has to be
calculated and stored on a per-line basis since I'm applying a machine
learning algorithm afterwards which uses those ratios as one factor to
separate coherent text from boilerplate (menus, navigation, copyright etc.).

I had anticipated a reply along the lines of "read the documents into a
DOM object and extract the text from that". That is also problematic
since it is not fast enough given the size of the input (That is an
assumption; I haven't benchmarked the FPC DOM implementation yet.), and
I don't see how I can calculate the text/markup ratio per line in a
simple fashion when using a DOM implementation.

I am *not* trying to clean or format simple or limited HTML on a string
basis. For stuff like that, I wouldn't have asked. I actually wouldn't
use Pascal for such tasks but rather sed or a Perl script at max.

I would still highly appreciate further input.
Regards
Roland
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Re: stripping HTML

2011-04-17 Thread Ralf Junker
HTML is not meant to be handled on a line-by-line basis as other
text-based formats. According to the specs, HTML is not line-based.
Browsers should display the following two HTML snippets identically:

  one#13#10two

and

  one#13#10two

With HTML tags removed both result to:

  one two

As such, a line-based text/markup ratio does not make much sense IMHO,
especially since browsers do strip line breaks in most text elements
except within  ... .

That said, I believe that DIHtmlParser should care for most of your needs:

  http://yunqa.de/delphi/doku.php/products/htmlparser/index

DIHtmlParser meets most of your requirements:

  * Not DOM based, very fast.

  * Hand-crafted, linear-scan Unicode HTML parser.

  * Handles SCRIPTs and STYLEs well.

  * Simple "Extract Text" demo included, may be modified as needed.

Drawbacks:

  * Like HTML, DIHtmlParser is not line-based. An option is available
to strip or preserve line breaks and white space.

  * Pre-compiled units available for Delphi only. The source code is
required to compile with FreePascal.

Ralf

On 17.04.2011 14:08, Roland Schäfer wrote:

> I feel I have to justify myself: I always do extensive web and list
> archive searches before posting to a list (hence the infrequency of my
> posts). I had actually found that snippet over a week ago but
> immediately discarded it since it is obviously a toy solution. I have a
> much better solution already using the PCRE library on a text stream,
> sometimes re-reading portions of the stream by way of backtracking. The
> problems with any approach like that (esp. 6-liners like the one linked
> in your post, but also more elaborate buts still makeshift regular
> expression magic) are:
> 
> 1. They don't handle faulty HTML well enough.
> 
> 2. They don't handle any multi-line constructs like comments or scripts.
> Depending on how naively you read the input (e.g., using
> TStringList.ReadFromFile), they even choke on simple tags with all sorts
> of line breaks in between, which are frequently found (and which are, to
> my knowledge, not even ill-formed). What do you do with this (for a start)?
> 
> ''
> 
> 3. They are potentially not the most efficient solution, which is an
> important factor if the stripping alone takes days.
> 
> As a clarification: I am mining several ~500GB results of Heritrix
> crawls containig all versions of XML, HTML, inline CSS, inline Scripts,
> etc. They need to be accurately stripped from HTML/XML (accurately means
> without losing too much real text). The text/markup ratio has to be
> calculated and stored on a per-line basis since I'm applying a machine
> learning algorithm afterwards which uses those ratios as one factor to
> separate coherent text from boilerplate (menus, navigation, copyright etc.).
> 
> I had anticipated a reply along the lines of "read the documents into a
> DOM object and extract the text from that". That is also problematic
> since it is not fast enough given the size of the input (That is an
> assumption; I haven't benchmarked the FPC DOM implementation yet.), and
> I don't see how I can calculate the text/markup ratio per line in a
> simple fashion when using a DOM implementation.
> 
> I am *not* trying to clean or format simple or limited HTML on a string
> basis. For stuff like that, I wouldn't have asked. I actually wouldn't
> use Pascal for such tasks but rather sed or a Perl script at max.
> 
> I would still highly appreciate further input.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Re: stripping HTML

2011-04-17 Thread Roland Schäfer
Thanks a lot for your reply.

On 4/17/2011 3:46 PM, Ralf Junker wrote:
> HTML is not meant to be handled on a line-by-line basis as other
> text-based formats. According to the specs, HTML is not line-based.
> Browsers should display the following two HTML snippets identically:
 [...]
> As such, a line-based text/markup ratio does not make much sense IMHO,
> especially since browsers do strip line breaks in most text elements
> except within  ... .

This is sort of off-topic, so I'll make it short: Yes, that is a problem
we are aware of. However, experiments with even simple threshholds
("remove lines with less than 50% text") were sort of successful. Simple
machine learning makes it much better. To avoid true paragraph detection
(which would be desirable but costly given the TB-sized input) we are
also experimenting with several line-based and non-line-based windows on
the input and cumulative html/text ratios for those windows. Also, this
is only stage one of the cleanup, and we run some more linguistically
informed and costly steps on the already much smaller amounts of data.

Maybe I'll give paragraph detection based on ,  etc. another
try, but we actually decided against that a while ago because we lost
huge amounts of valuable input due to non-use or very creative use of
such elements in actual web pages.

> That said, I believe that DIHtmlParser should care for most of your needs:

Yes, that looks perfect. I wouldn't even have a problem with the license
or with paying for it, and I even still have D7. However, my program has
to run on our Debian 64-bit servers.

Regards
Roland
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Re: stripping HTML

2011-04-17 Thread Felipe Monteiro de Carvalho
2011/4/17 Roland Schäfer :
> Yes, that looks perfect. I wouldn't even have a problem with the license
> or with paying for it, and I even still have D7. However, my program has
> to run on our Debian 64-bit servers.

You could contact the authors and say that you would like to buy a
license if it works in FPC linux-x86-64

-- 
Felipe Monteiro de Carvalho
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal


Re: [fpc-pascal] Re: stripping HTML

2011-04-17 Thread Ralf Junker
On 17.04.2011 16:30, Felipe Monteiro de Carvalho wrote:

>> Yes, that looks perfect. I wouldn't even have a problem with the license
>> or with paying for it, and I even still have D7. However, my program has
>> to run on our Debian 64-bit servers.
>
> You could contact the authors and say that you would like to buy a
> license if it works in FPC linux-x86-64

I am the author of DIHtmlParser.

I do not know if DIHtmlParser compiles and works in FPC linux-x86-64
because I do not have that environment available for testing.
Unfortunately, low demand for that platform does not justify setting it
up and supporting it on a regular basis.

I can say, however, that the latest version source code is Pascal only
and compiles on FPC Win32 without platform warnings. But I do suspect
that it will need a few IFDEFs to make it Linux compatible. If so, I
would of course be glad to add any to the code so they will be available
in future versions.

However, having read Stefan's more detailed, off-topic requirements
description, I'd rather suggest that they come up with their own HTML
parser and text filter. It sounds too specific to me to be handled by
any standard component already available.

Ralf
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-pascal