Which statistics in particular (which methods)? On Thu, Jan 17, 2013 at 5:10 AM, Jon Stewart <j...@lightboxtechnologies.com> wrote: > Thanks very much for your reply, Ian. > > I am using SlowCompositeReaderWrapper because I am also retrieving the > term frequency statistics for the corpus (at the end of the day, I am > doing some machine learning/document clustering). Despite its name and > warning documentation not to use it, SlowCompositeReaderWrapper seems > to be the only baked-in way of getting total corpus term statistics > from a DirectoryReader, n'est-ce pas? Incidentally, I am using the > StandardAnalyzer as well. > > > Jon > > On Thu, Jan 17, 2013 at 5:06 AM, Ian Lea <ian....@gmail.com> wrote: >> When I run your code, as is except for using RAMDirectory and setting >> up an IndexWriter using StandardAnalyzer >> >> RAMDirectory dir = new RAMDirectory(); >> Analyzer anl = new StandardAnalyzer(Version.LUCENE_40); >> IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_40, >> anl); >> IndexWriter iw = new IndexWriter(dir, iwcfg); >> ... >> iw.addDocument(doc); >> iw.close(); >> >> it prints >> >> doc 0 had 1 terms. >> >> If change text to .e.g. "this is foobar gibberish" it says there are 2 >> terms. So it looks OK to me. "this" and "is" are presumably in the >> default list of stop words. >> >> Not relevant, but why are you using SlowCompositeReaderWrapper rather than >> just >> IndexReader rdr = DirectoryReader.open(dir)? I get the same results either >> way, >> >> >> -- >> Ian. >> >> >> On Thu, Jan 17, 2013 at 5:52 AM, Jon Stewart >> <j...@lightboxtechnologies.com> wrote: >>> Hello, >>> >>> I cannot extract document term vectors from an index, and have not >>> turned up much in some determined googling. In short, when I call >>> IndexReader.getTermVector(docID, field) or >>> IndexReader.getTermVectors(docID) and then navigate down to the Terms >>> for the specified field, I get a null result. >>> >>> // Indexing: >>> String bodyText = "this is foobar"; >>> final FieldType BodyOptions = new FieldType(); >>> BodyOptions.setIndexed(true); >>> >>> BodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); >>> BodyOptions.setStored(true); >>> BodyOptions.setStoreTermVectors(true); >>> BodyOptions.setTokenized(true); >>> Document doc = new Document(); >>> doc.add(new Field("body", bodyText, BodyOptions)); >>> >>> When I examine docs in Luke, I can see the term vectors. >>> >>> // Retrieving (at a later time) >>> DirectoryReader dirRdr = DirectoryReader.open(FSDirectory.open(new >>> File(path))); >>> SlowCompositeReaderWrapper rdr = new SlowCompositeReaderWrapper(dirRdr); >>> for (int i = 0; i < rdr.maxDoc(); ++i) { >>> int numTerms = 0; >>> Terms terms = rdr.getTermVector(i, "body"); >>> if (terms != null) { >>> TermsEnum term = terms.iterator(null); >>> while (term.next() != null) { >>> ++numTerms; >>> } >>> System.out.println("doc " + i + " had " + numTerms + " terms"); >>> } >>> else { >>> System.err.println("null term vector on doc " + i); >>> } >>> } >>> >>> On every doc, the Terms object I get back from getTermVector(i, "body") is >>> null. >>> >>> >>> Jon >>> -- >>> Jon Stewart, Principal >>> (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > > -- > Jon Stewart, Principal > (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org >
--------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org