Re: header/footer identification and general scaping tools

2010-06-28 Thread Boris Aleksandrovsky
rse the HTML and get the elements of interest. > http://ccil.org/~cowan/XML/tagsoup/ > > > > On Mon, Jun 28, 2010 at 4:06 PM, Boris Aleksandrovsky > wrote: > > I was wondering if any of you know of any open-source solutions for > general > > issues which arise

header/footer identification and general scaping tools

2010-06-28 Thread Boris Aleksandrovsky
I was wondering if any of you know of any open-source solutions for general issues which arise in web crawling - how do you remove headers/footers/javascript and generally cleanup html of a web-page before indexing? We have a first-pass solution implemented using custom code, but this must be a pro

Re: Free software for language detection

2009-03-27 Thread Boris Aleksandrovsky
Lisheng, You might want to look at the Nutch LanguageID plugin (http://wiki.apache.org/nutch/LanguageIdentifier) too. Cheers, Boris On Fri, Mar 27, 2009 at 10:22 AM, Zhang, Lisheng wrote: > Thanks very much! > > -Original Message- > From: jochen.sc...@gmail.com [mailto:jochen.sc...@gmai

Re: word frequency list?

2006-08-31 Thread Boris Aleksandrovsky
Jason, You can look here: http://www.cs.ualberta.ca/~lindek/downloads.htm for Word frequency counts from a 1.5B word corpus (TREC disks 1-5 and the Reuters corpus ). The words are normalized as follows: ALL CAP words are prepended with a_