Well, depending upon your storage requirements, it's actually much easier than that. Assuming you're adding this field (or a duplicate) as UN_TOKENIZED (in this case, no need to store), you can just spin through all the terms for that field with TermDocs/TermEnum. The trick is to have your term start with a value of "". I.e. new Term(field, "") to enumerate them all. See TermDocs.seek.
This is without TermVectors at all, which'll save you some space. Erick On 3/20/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
Thanks, I see what you are saying. Seems that if I create the field at index time with term vectors stored, then I can iterate through the documents and get both the unique identifier and the terms, right? My original question was imprecise in that I'm going to want to get all the terms for *all* the documents (one document at a time) so I can just iterate through all the documents using for (int i=0; i<indexReaderR.numDocs(); i++) { TermFreqVector tfv = indexReaderR.getTermFreqVector(i,"my text field name"); Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED] "Erick Erickson" <[EMAIL PROTECTED]> 03/20/2007 03:08 PM Please respond to java-user@lucene.apache.org To java-user@lucene.apache.org cc Subject Re: Obtaining the (indexed) terms in a field in a particular document Sorry, but you have to have the Lucene document ID, which you can get either as part of a Hits or HitCollector or... or by using TermDocs/TermEnum on your unique id (my_id in your example). Erick On 3/20/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > You can do a document.get(field), *assuming* you have stored the data > (Field.Store.YES) at index time, although you may not get > stop words. > > On 3/20/07, Donna L Gresh <[EMAIL PROTECTED]> wrote: > > > > My apologies if this is a simple question-- > > > > How can I get all the (stemmed and stop words removed, etc.) terms in a > > particular field of a particular document? > > > > Suppose my documents each consist of two fields, one with the name > > "my_id" > > and a unique identifier, and the other being some text string consisting > > of a number of words. > > I'd like to get all the terms in the text string given the unique > > identifier. > > > > (My basic reason is to do a sort of document similarity between the text > > > > string and some other text string, doing a boolean query with > > a number of SHOULD clauses, if this makes sense; I'm welcome to > > suggestions of better ways to do this) > > > > Donna L. Gresh > > > >