Re: Document term vectors in Lucene 4

Jon Stewart Thu, 17 Jan 2013 07:12:12 -0800

D'oh!!!! Thanks!

Does TermsEnum.totalTermFreq() return the per-doc frequencies? It
looks like it empirically, but the documentation refers to corpus
usage, not document.field usage.


Jon

On Thu, Jan 17, 2013 at 10:00 AM, Ian Lea <ian....@gmail.com> wrote:
> typo time.  You need doc2.add(...) not 2 doc.add(...) statements.
>
>
> --
> Ian.
>
>
> On Thu, Jan 17, 2013 at 2:49 PM, Jon Stewart
> <j...@lightboxtechnologies.com> wrote:
>> On Thu, Jan 17, 2013 at 9:08 AM, Robert Muir <rcm...@gmail.com> wrote:
>>> Which statistics in particular (which methods)?
>>
>> I'd like to know the frequency of each term in each document. Those
>> term counts for the most frequent terms in the corpus will make it
>> into the document vectors for clustering.
>>
>> Looking at Terms and TermsEnum, I'm actually somewhat baffled about
>> how to do this. Iterating over the TermsEnums in a Terms retrieved by
>> IndexReader.getTermVector() will tell me about the presence of a term
>> within a document, but I don't see a simple "count" or "freq" method
>> in TermsEnum--the methods there look like corpus statistics.
>>
>> Based on Ian's reply, I created the following one-file test program.
>> The results I get are weird: I get a term vector back for the first
>> document, but not for the second.
>>
>> Output:
>> doc 0 had term 'baz'
>> doc 0 had term 'foobar'
>> doc 0 had term 'gibberish'
>> doc 0 had 3 terms
>> doc 1 had no term vector for body
>>
>> Thanks again for the responses and assistance.
>>
>>
>> Jon
>>
>>
>> import java.io.File;
>> import java.io.IOException;
>>
>> import org.apache.lucene.analysis.standard.StandardAnalyzer;
>>
>> import org.apache.lucene.index.IndexWriter;
>> import org.apache.lucene.index.IndexWriterConfig.OpenMode;
>> import org.apache.lucene.index.IndexWriterConfig;
>> import org.apache.lucene.index.FieldInfo.IndexOptions;
>> import org.apache.lucene.index.CorruptIndexException;
>> import org.apache.lucene.index.AtomicReader;
>> import org.apache.lucene.index.IndexableField;
>> import org.apache.lucene.index.Terms;
>> import org.apache.lucene.index.TermsEnum;
>> import org.apache.lucene.index.SlowCompositeReaderWrapper;
>> import org.apache.lucene.index.DirectoryReader;
>>
>> import org.apache.lucene.store.Directory;
>> import org.apache.lucene.store.FSDirectory;
>>
>> import org.apache.lucene.util.BytesRef;
>> import org.apache.lucene.util.Version;
>>
>> import org.apache.lucene.document.Document;
>> import org.apache.lucene.document.Field;
>> import org.apache.lucene.document.StringField;
>> import org.apache.lucene.document.FieldType;
>>
>> public class LuceneTest {
>>
>>   static void createIndex(final String path) throws IOException,
>> CorruptIndexException {
>>     final Directory dir = FSDirectory.open(new File(path));
>>     final StandardAnalyzer analyzer = new 
>> StandardAnalyzer(Version.LUCENE_40);
>>     final IndexWriterConfig iwc = new
>> IndexWriterConfig(Version.LUCENE_40, analyzer);
>>     iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
>>     iwc.setRAMBufferSizeMB(256.0);
>>     final IndexWriter writer = new IndexWriter(dir, iwc);
>>
>>     final FieldType bodyOptions = new FieldType();
>>     bodyOptions.setIndexed(true);
>>     
>> bodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
>>     bodyOptions.setStored(true);
>>     bodyOptions.setStoreTermVectors(true);
>>     bodyOptions.setTokenized(true);
>>
>>     final Document doc = new Document();
>>     doc.add(new Field("body", "this foobar is gibberish, baz", bodyOptions));
>>     writer.addDocument(doc);
>>
>>     final Document doc2 = new Document();
>>     doc.add(new Field("body", "I don't know what to tell you, qux.
>> Some foobar is just fubar.", bodyOptions));
>>     writer.addDocument(doc2);
>>
>>     writer.close();
>>   }
>>
>>   static void readIndex(final String path) throws IOException,
>> CorruptIndexException {
>>     final DirectoryReader dirReader =
>> DirectoryReader.open(FSDirectory.open(new File(path)));
>>     final SlowCompositeReaderWrapper rdr = new
>> SlowCompositeReaderWrapper(dirReader);
>>
>>     int max = rdr.maxDoc();
>>
>>     TermsEnum term = null;
>>     // iterate docs
>>     for (int i = 0; i < max; ++i) {
>>       // get term vector for body field
>>       final Terms terms = rdr.getTermVector(i, "body");
>>       if (terms != null) {
>>         // count terms in doc
>>         int numTerms = 0;
>>         term = terms.iterator(term);
>>         while (term.next() != null) {
>>           System.out.println("doc " + i + " had term '" +
>> term.term().utf8ToString() + "'");
>>           ++numTerms;
>>
>>           // would like to record doc term frequencies here, i.e.,
>> counts[i][term.term()] = term.freq()
>>         }
>>         System.out.println("doc " + i + " had " + numTerms + " terms");
>>       }
>>       else {
>>         System.err.println("doc " + i + " had no term vector for body");
>>       }
>>     }
>>   }
>>
>>   public static void main(String[] args) throws IOException,
>> InterruptedException, CorruptIndexException {
>>     final String path = args[0];
>>     createIndex(path);
>>     readIndex(path);
>>   }
>> }
>>
>> --
>> Jon Stewart, Principal
>> (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-- 
Jon Stewart, Principal
(646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Document term vectors in Lucene 4

Reply via email to