Re: Document term vectors in Lucene 4

Jon Stewart Thu, 17 Jan 2013 05:11:24 -0800

Thanks very much for your reply, Ian.

I am using SlowCompositeReaderWrapper because I am also retrieving the
term frequency statistics for the corpus (at the end of the day, I am
doing some machine learning/document clustering). Despite its name and
warning documentation not to use it, SlowCompositeReaderWrapper seems
to be the only baked-in way of getting total corpus term statistics
from a DirectoryReader, n'est-ce pas? Incidentally, I am using the
StandardAnalyzer as well.



Jon

On Thu, Jan 17, 2013 at 5:06 AM, Ian Lea <ian....@gmail.com> wrote:
> When I run your code, as is except for using RAMDirectory and setting
> up an IndexWriter using StandardAnalyzer
>
>         RAMDirectory dir = new RAMDirectory();
>         Analyzer anl = new StandardAnalyzer(Version.LUCENE_40);
>         IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_40, 
> anl);
>         IndexWriter iw = new IndexWriter(dir, iwcfg);
>         ...
>         iw.addDocument(doc);
>         iw.close();
>
> it prints
>
> doc 0 had 1 terms.
>
> If change text to .e.g. "this is foobar gibberish" it says there are 2
> terms.  So it looks OK to me. "this" and "is" are presumably in the
> default list of stop words.
>
> Not relevant, but why are you using SlowCompositeReaderWrapper rather than 
> just
> IndexReader rdr = DirectoryReader.open(dir)?  I get the same results either 
> way,
>
>
> --
> Ian.
>
>
> On Thu, Jan 17, 2013 at 5:52 AM, Jon Stewart
> <j...@lightboxtechnologies.com> wrote:
>> Hello,
>>
>> I cannot extract document term vectors from an index, and have not
>> turned up much in some determined googling. In short, when I call
>> IndexReader.getTermVector(docID, field) or
>> IndexReader.getTermVectors(docID) and then navigate down to the Terms
>> for the specified field, I get a null result.
>>
>> // Indexing:
>>   String bodyText = "this is foobar";
>>   final FieldType BodyOptions = new FieldType();
>>   BodyOptions.setIndexed(true);
>>   
>> BodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
>>   BodyOptions.setStored(true);
>>   BodyOptions.setStoreTermVectors(true);
>>   BodyOptions.setTokenized(true);
>>   Document doc = new Document();
>>   doc.add(new Field("body", bodyText, BodyOptions));
>>
>> When I examine docs in Luke, I can see the term vectors.
>>
>> // Retrieving (at a later time)
>>   DirectoryReader dirRdr = DirectoryReader.open(FSDirectory.open(new
>> File(path)));
>>   SlowCompositeReaderWrapper rdr = new SlowCompositeReaderWrapper(dirRdr);
>>   for (int i = 0; i < rdr.maxDoc(); ++i) {
>>     int numTerms = 0;
>>     Terms terms = rdr.getTermVector(i, "body");
>>     if (terms != null) {
>>       TermsEnum term = terms.iterator(null);
>>       while (term.next() != null) {
>>         ++numTerms;
>>       }
>>       System.out.println("doc " + i + " had " + numTerms + " terms");
>>     }
>>     else {
>>       System.err.println("null term vector on doc " + i);
>>     }
>>   }
>>
>> On every doc, the Terms object I get back from getTermVector(i, "body") is 
>> null.
>>
>>
>> Jon
>> --
>> Jon Stewart, Principal
>> (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
Jon Stewart, Principal
(646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Document term vectors in Lucene 4

Reply via email to