Re: Document term vectors in Lucene 4

Jon Stewart Fri, 18 Jan 2013 07:46:09 -0800

Thanks! I still can't see what was wrong with my original code--must
have been a dumb typo somewhere--but starting over from that example
now works on indices generated from my real indexing code. I will try
to blog about it next week so there is some sample code up on the web
for anyone else searching for how to do something similar.


I did not know about MultiFields, but yes, that seems to get rid of
the need for the SlowCompositeReaderWrapper. I really doubt
SlowCompositeReaderWrapper would be all that slow for my purposes,
though—I care more about indexing speed than ultra-fast query
responses. With multithreaded indexing, Lucene 4 seems to be able to
index files about as fast as I can read them in from disk, even
including Tika text extraction. Kudos.


Jon

On Fri, Jan 18, 2013 at 6:12 AM, Ian Lea <ian....@gmail.com> wrote:
> To get stats from the whole index I think you need to come at this
> from a different direction.  See the 4.0 migration guide for some
> details.
>
> With a variation on your code and 2 docs
>
> doc1: foobar qux quote
> doc2: foobar qux qux quorum
>
> this code snippet
>
>         Fields fields = MultiFields.getFields(rdr);
>         Terms terms = fields.terms("body");
>         TermsEnum te = terms.iterator(null);
>         while (te.next() != null) {
>             String tt = te.term().utf8ToString();
>             System.out.printf("%s totalFreq()=%s, docFreq=%s\n",
>                               tt,
>                               te.totalTermFreq(),
>                               te.docFreq());
>         }
>
> displays
>
> foobar totalFreq()=2, docFreq=2
> quorum totalFreq()=1, docFreq=1
> quote totalFreq()=1, docFreq=1
> qux totalFreq()=3, docFreq=2
>
> This is with a standard IndexReader as returned by
> DirectoryReader.open(dir), on a RAMDirectory with 2 docs so there
> won't be many segments.  But from my reading of the migration guide
> you shouldn't need to use the Composite reader.
>
>
> Hope this helps - we are getting outside my area of expertise so don't
> trust anything I say.
>
>
> --
> Ian.
>
> On Thu, Jan 17, 2013 at 3:11 PM, Jon Stewart
> <j...@lightboxtechnologies.com> wrote:
>> D'oh!!!! Thanks!
>>
>> Does TermsEnum.totalTermFreq() return the per-doc frequencies? It
>> looks like it empirically, but the documentation refers to corpus
>> usage, not document.field usage.
>>
>> Jon
>>
>> On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea <ian....@gmail.com> wrote:
>>> typo time.  You need doc2.add(...) not 2 doc.add(...) statements.
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>> On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart
>>> <j...@lightboxtechnologies.com> wrote:
>>>> On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir <rcm...@gmail.com> wrote:
>>>>> Which statistics in particular (which methods)?
>>>>
>>>> I'd like to know the frequency of each term in each document. Those
>>>> term counts for the most frequent terms in the corpus will make it
>>>> into the document vectors for clustering.
>>>>
>>>> Looking at Terms and TermsEnum, I'm actually somewhat baffled about
>>>> how to do this. Iterating over the TermsEnums in a Terms retrieved by
>>>> IndexReader.getTermVector() will tell me about the presence of a term
>>>> within a document, but I don't see a simple "count" or "freq" method
>>>> in TermsEnum--the methods there look like corpus statistics.
>>>>
>>>> Based on Ian's reply, I created the following one-file test program.
>>>> The results I get are weird: I get a term vector back for the first
>>>> document, but not for the second.
>>>>
>>>> Output:
>>>> doc 0 had term 'baz'
>>>> doc 0 had term 'foobar'
>>>> doc 0 had term 'gibberish'
>>>> doc 0 had 3 terms
>>>> doc 1 had no term vector for body
>>>>
>>>> Thanks again for the responses and assistance.
>>>>
>>>>
>>>> Jon
>>>>
>>>>
>>>> import java.io.File;
>>>> import java.io.IOException;
>>>>
>>>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>>>
>>>> import org.apache.lucene.index.IndexWriter;
>>>> import org.apache.lucene.index.IndexWriterConfig.OpenMode;
>>>> import org.apache.lucene.index.IndexWriterConfig;
>>>> import org.apache.lucene.index.FieldInfo.IndexOptions;
>>>> import org.apache.lucene.index.CorruptIndexException;
>>>> import org.apache.lucene.index.AtomicReader;
>>>> import org.apache.lucene.index.IndexableField;
>>>> import org.apache.lucene.index.Terms;
>>>> import org.apache.lucene.index.TermsEnum;
>>>> import org.apache.lucene.index.SlowCompositeReaderWrapper;
>>>> import org.apache.lucene.index.DirectoryReader;
>>>>
>>>> import org.apache.lucene.store.Directory;
>>>> import org.apache.lucene.store.FSDirectory;
>>>>
>>>> import org.apache.lucene.util.BytesRef;
>>>> import org.apache.lucene.util.Version;
>>>>
>>>> import org.apache.lucene.document.Document;
>>>> import org.apache.lucene.document.Field;
>>>> import org.apache.lucene.document.StringField;
>>>> import org.apache.lucene.document.FieldType;
>>>>
>>>> public class LuceneTest {
>>>>
>>>>   static void createIndex(final String path) throws IOException,
>>>> CorruptIndexException {
>>>>     final Directory dir = FSDirectory.open(new File(path));
>>>>     final StandardAnalyzer analyzer = new 
>>>> StandardAnalyzer(Version.LUCENE_40);
>>>>     final IndexWriterConfig iwc = new
>>>> IndexWriterConfig(Version.LUCENE_40, analyzer);
>>>>     iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>>>     iwc.setRAMBufferSizeMB(256.0);
>>>>     final IndexWriter writer = new IndexWriter(dir, iwc);
>>>>
>>>>     final FieldType bodyOptions = new FieldType();
>>>>     bodyOptions.setIndexed(true);
>>>>     
>>>> bodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
>>>>     bodyOptions.setStored(true);
>>>>     bodyOptions.setStoreTermVectors(true);
>>>>     bodyOptions.setTokenized(true);
>>>>
>>>>     final Document doc = new Document();
>>>>     doc.add(new Field("body", "this foobar is gibberish, baz", 
>>>> bodyOptions));
>>>>     writer.addDocument(doc);
>>>>
>>>>     final Document doc2 = new Document();
>>>>     doc.add(new Field("body", "I don't know what to tell you, qux.
>>>> Some foobar is just fubar.", bodyOptions));
>>>>     writer.addDocument(doc2);
>>>>
>>>>     writer.close();
>>>>   }
>>>>
>>>>   static void readIndex(final String path) throws IOException,
>>>> CorruptIndexException {
>>>>     final DirectoryReader dirReader =
>>>> DirectoryReader.open(FSDirectory.open(new File(path)));
>>>>     final SlowCompositeReaderWrapper rdr = new
>>>> SlowCompositeReaderWrapper(dirReader);
>>>>
>>>>     int max = rdr.maxDoc();
>>>>
>>>>     TermsEnum term = null;
>>>>     // iterate docs
>>>>     for (int i = 0; i < max; ++i) {
>>>>       // get term vector for body field
>>>>       final Terms terms = rdr.getTermVector(i, "body");
>>>>       if (terms != null) {
>>>>         // count terms in doc
>>>>         int numTerms = 0;
>>>>         term = terms.iterator(term);
>>>>         while (term.next() != null) {
>>>>           System.out.println("doc " + i + " had term '" +
>>>> term.term().utf8ToString() + "'");
>>>>           ++numTerms;
>>>>
>>>>           // would like to record doc term frequencies here, i.e.,
>>>> counts[i][term.term()] = term.freq()
>>>>         }
>>>>         System.out.println("doc " + i + " had " + numTerms + " terms");
>>>>       }
>>>>       else {
>>>>         System.err.println("doc " + i + " had no term vector for body");
>>>>       }
>>>>     }
>>>>   }
>>>>
>>>>   public static void main(String[] args) throws IOException,
>>>> InterruptedException, CorruptIndexException {
>>>>     final String path = args[0];
>>>>     createIndex(path);
>>>>     readIndex(path);
>>>>   }
>>>> }
>>>>
>>>> --
>>>> Jon Stewart, Principal
>>>> (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>>
>>
>> --
>> Jon Stewart, Principal
>> (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
Jon Stewart, Principal
(646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Document term vectors in Lucene 4

Reply via email to