Re: Obtaining the (indexed) terms in a field in a particular document

Erick Erickson Tue, 20 Mar 2007 12:45:10 -0800

Well, depending upon your storage requirements, it's actually
much easier than that. Assuming you're adding
this field (or a duplicate) as UN_TOKENIZED (in this case, no
need to store), you can just spin
through all the terms for that field with TermDocs/TermEnum.
The trick is to have your term start with a value of "". I.e.
new Term(field, "") to enumerate them all. See TermDocs.seek.


This is without TermVectors at all, which'll save you some space.

Erick

On 3/20/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:

Thanks, I see what you are saying.

Seems that if I create the field at index time with term vectors stored,
then I can iterate through the documents and get both the unique
identifier and the terms, right? My original question was imprecise in
that I'm going to want to get all the terms for *all* the documents (one
document at a time) so I can just iterate through all the documents using

                for (int i=0; i<indexReaderR.numDocs(); i++) {
                        TermFreqVector tfv =
indexReaderR.getTermFreqVector(i,"my text field name");

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]

"Erick Erickson" <[EMAIL PROTECTED]>
03/20/2007 03:08 PM
Please respond to
java-user@lucene.apache.org

To
java-user@lucene.apache.org
cc

Subject
Re: Obtaining the (indexed) terms in a field in a particular document

Sorry, but you have to have the Lucene document ID, which you
can get either as part of a Hits or HitCollector or...
or by using TermDocs/TermEnum on your unique id (my_id in
your example).

Erick

On 3/20/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> You can do a document.get(field), *assuming* you have stored the data
> (Field.Store.YES) at index time, although you may not get
> stop words.
>
> On 3/20/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
> >
> > My apologies if this is a simple question--
> >
> > How can I get all the (stemmed and stop words removed, etc.) terms in
a
> > particular field of a particular document?
> >
> > Suppose my documents each consist of two fields, one with the name
> > "my_id"
> > and a unique identifier, and the other being some text string
consisting
> > of a number of words.
> > I'd like to get all the terms in the text string given the unique
> > identifier.
> >
> > (My basic reason is to do a sort of document similarity between the
text
> >
> > string and some other text string, doing a boolean query with
> > a number of SHOULD clauses, if this makes sense; I'm welcome to
> > suggestions of better ways to do this)
> >
> > Donna L. Gresh
> >
>
>

Re: Obtaining the (indexed) terms in a field in a particular document

Reply via email to