Quoting Erik Hatcher <[EMAIL PROTECTED]>: > > On May 7, 2005, at 10:24 PM, [EMAIL PROTECTED] wrote: > > > Hi - > > > > Is there any TREC parser (indexing many documents that are in the > > same file)for Lucene avaiable? > > Not to my knowledge. I've been working on indexing some TREC data > myself, with some (not written by me) code to do simplistic grabbing > of a couple of fields from the SGML. > > If there is something handy to parse these files, I'd love to give it > a spin though.
We at NIST have code to do this, integrated with Lucene. We have a simple SGML/XML parser and a simple reader which walks the documents... that's most of what you need. You could probably hack it together with NekoHTML in a day or two, actually. It can index TREC newswire corpora in a couple hours on a pretty stock Dell workstation, and the TREC "Terabyte" collection (actually 480GB, 25 million web pages) in a weekend on a handful of workstations indexing in parallel. It's not quite ready for prime-time release yet, though. I hope to have some time this summer to clean it up enough that we're not embarrassed to give it out ;-). If your need is desperate, drop me an email and I'll see what I can do. Ian --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]