subject:"Re\: header\/footer identification and general scaping tools"

Re: header/footer identification and general scaping tools

2010-06-28 Thread Simon Willnauer

Boris, you might wanna look at http://code.google.com/p/boilerpipe/ simon On Mon, Jun 28, 2010 at 10:48 PM, Boris Aleksandrovsky wrote: > Thanks, Sashi, I am asking more about a general library which will remove > those HTML element which are unwanted/useless for indexing. For instance, we > are

Re: header/footer identification and general scaping tools

2010-06-28 Thread Boris Aleksandrovsky

Thanks, Sashi, I am asking more about a general library which will remove those HTML element which are unwanted/useless for indexing. For instance, we are using a general method to remove headers by comparing the structure of HTML on the top-level document from the site (e.g. www.nytimes.com) and t

Re: header/footer identification and general scaping tools

2010-06-28 Thread Shashi Kant

I have used TagSoup to parse the HTML and get the elements of interest. http://ccil.org/~cowan/XML/tagsoup/ On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky wrote: > I was wondering if any of you know of any open-source solutions for general > issues which arise in web crawling - how do yo