Hi, I am not sure if there's something in the contrib for GOV2 but it really depends on what you want to parse. If you are just interested in full-text search then it should be similar to parsing a regular document while being conscious of the trec-specific delimiters. It's something like <DOC>. However, if you are interested in performing structured search and maintaining indexes over different fields such as titles, etc. then this will require some customisation. Note that if you want to store the anchor text separately and perform some sort of link resolution and page ranking then again you will need to customize your parsing.
h. > Hi All , > > I am working on a project on Static Index pruning and I am using the TREC > GOV2 database . I have seen that the Trec data can be parsed and the > necessary java files are present in the contrib package , but has any user > used Lucene to index the GOV2 dataset or is there source code available > for > the same ? > > Regards > Jake Dsouza > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org