Re: indexing TREC

Ian Soboroff Sun, 08 May 2005 19:39:40 -0700

Quoting Erik Hatcher <[EMAIL PROTECTED]>:

> 
> On May 7, 2005, at 10:24 PM, [EMAIL PROTECTED] wrote:
> 
> > Hi -
> >
> > Is there any TREC parser (indexing many documents that are in the  
> > same file)for Lucene avaiable?
> 
> Not to my knowledge.  I've been working on indexing some TREC data  
> myself, with some (not written by me) code to do simplistic grabbing  
> of a couple of fields from the SGML.
> 
> If there is something handy to parse these files, I'd love to give it  
> a spin though.


We at NIST have code to do this, integrated with Lucene.  We have a simple
SGML/XML parser and a simple reader which walks the documents... that's most of
what you need.  You could probably hack it together with NekoHTML in a day or
two, actually.  It can index TREC newswire corpora in a couple hours on a pretty
stock Dell workstation, and the TREC "Terabyte" collection (actually 480GB, 25
million web pages) in a weekend on a handful of workstations indexing in 
parallel.

It's not quite ready for prime-time release yet, though.  I hope to have some
time this summer to clean it up enough that we're not embarrassed to give it out
;-).  If your need is desperate, drop me an email and I'll see what I can do.

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing TREC

Reply via email to