Thanks very much for your reply, Ian. I am using SlowCompositeReaderWrapper because I am also retrieving the term frequency statistics for the corpus (at the end of the day, I am doing some machine learning/document clustering). Despite its name and warning documentation not to use it, SlowCompositeReaderWrapper seems to be the only baked-in way of getting total corpus term statistics from a DirectoryReader, n'est-ce pas? Incidentally, I am using the StandardAnalyzer as well.
Jon On Thu, Jan 17, 2013 at 5:06 AM, Ian Lea <ian....@gmail.com> wrote: > When I run your code, as is except for using RAMDirectory and setting > up an IndexWriter using StandardAnalyzer > > RAMDirectory dir = new RAMDirectory(); > Analyzer anl = new StandardAnalyzer(Version.LUCENE_40); > IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_40, > anl); > IndexWriter iw = new IndexWriter(dir, iwcfg); > ... > iw.addDocument(doc); > iw.close(); > > it prints > > doc 0 had 1 terms. > > If change text to .e.g. "this is foobar gibberish" it says there are 2 > terms. So it looks OK to me. "this" and "is" are presumably in the > default list of stop words. > > Not relevant, but why are you using SlowCompositeReaderWrapper rather than > just > IndexReader rdr = DirectoryReader.open(dir)? I get the same results either > way, > > > -- > Ian. > > > On Thu, Jan 17, 2013 at 5:52 AM, Jon Stewart > <j...@lightboxtechnologies.com> wrote: >> Hello, >> >> I cannot extract document term vectors from an index, and have not >> turned up much in some determined googling. In short, when I call >> IndexReader.getTermVector(docID, field) or >> IndexReader.getTermVectors(docID) and then navigate down to the Terms >> for the specified field, I get a null result. >> >> // Indexing: >> String bodyText = "this is foobar"; >> final FieldType BodyOptions = new FieldType(); >> BodyOptions.setIndexed(true); >> >> BodyOptions.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); >> BodyOptions.setStored(true); >> BodyOptions.setStoreTermVectors(true); >> BodyOptions.setTokenized(true); >> Document doc = new Document(); >> doc.add(new Field("body", bodyText, BodyOptions)); >> >> When I examine docs in Luke, I can see the term vectors. >> >> // Retrieving (at a later time) >> DirectoryReader dirRdr = DirectoryReader.open(FSDirectory.open(new >> File(path))); >> SlowCompositeReaderWrapper rdr = new SlowCompositeReaderWrapper(dirRdr); >> for (int i = 0; i < rdr.maxDoc(); ++i) { >> int numTerms = 0; >> Terms terms = rdr.getTermVector(i, "body"); >> if (terms != null) { >> TermsEnum term = terms.iterator(null); >> while (term.next() != null) { >> ++numTerms; >> } >> System.out.println("doc " + i + " had " + numTerms + " terms"); >> } >> else { >> System.err.println("null term vector on doc " + i); >> } >> } >> >> On every doc, the Terms object I get back from getTermVector(i, "body") is >> null. >> >> >> Jon >> -- >> Jon Stewart, Principal >> (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -- Jon Stewart, Principal (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org