Boris, you might wanna look at http://code.google.com/p/boilerpipe/
simon
On Mon, Jun 28, 2010 at 10:48 PM, Boris Aleksandrovsky
wrote:
> Thanks, Sashi, I am asking more about a general library which will remove
> those HTML element which are unwanted/useless for indexing. For instance, we
> are
Thanks, Sashi, I am asking more about a general library which will remove
those HTML element which are unwanted/useless for indexing. For instance, we
are using a general method to remove headers by comparing the structure of
HTML on the top-level document from the site (e.g. www.nytimes.com) and t
I have used TagSoup to parse the HTML and get the elements of interest.
http://ccil.org/~cowan/XML/tagsoup/
On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky
wrote:
> I was wondering if any of you know of any open-source solutions for general
> issues which arise in web crawling - how do yo