Hi, I was able to finally figure this out. Lucene's Benchmark utility has some default parsers for TREC datasets. I noticed while parsing it was not parsing the title correctly for my dataset, eventually setting it to null. Therefore it was not getting indexed even though I was asking it to.
I works well once I fixed the parser. Regards, Sachin Kulkarni On Tue, Aug 19, 2014 at 9:53 PM, Sachin Kulkarni <kulk...@hawk.iit.edu> wrote: > Hi Kumaran, > > See below some part of the code and the .alg file. > Here is the function from DocMaker.java from the package "package > org.apache.lucene.benchmark.byTask.feeds" > > /** Set the configuration parameters of this doc maker. */ > public void setConfig(Config config, ContentSource source) { > this.config = config; > this.source = source; > > boolean stored = config.get("doc.stored", false); > boolean bodyStored = config.get("doc.body.stored", stored); > boolean tokenized = config.get("doc.tokenized", true); > boolean bodyTokenized = config.get("doc.body.tokenized", tokenized); > boolean norms = config.get("doc.tokenized.norms", false); > boolean bodyNorms = config.get("doc.body.tokenized.norms", true); > boolean termVec = config.get("doc.term.vector", false); > boolean termVecPositions = config.get("doc.term.vector.positions", > false); > boolean termVecOffsets = config.get("doc.term.vector.offsets", false); > > valType = new FieldType(TextField.TYPE_NOT_STORED); > valType.setStored(stored); > valType.setTokenized(tokenized); > valType.setOmitNorms(!norms); > valType.setStoreTermVectors(termVec); > valType.setStoreTermVectorPositions(termVecPositions); > valType.setStoreTermVectorOffsets(termVecOffsets); > > valType.freeze(); > > bodyValType = new FieldType(TextField.TYPE_NOT_STORED); > bodyValType.setStored(bodyStored); > bodyValType.setTokenized(bodyTokenized); > bodyValType.setOmitNorms(!bodyNorms); > bodyValType.setStoreTermVectors(termVec); > bodyValType.setStoreTermVectorPositions(termVecPositions); > bodyValType.setStoreTermVectorOffsets(termVecOffsets); > bodyValType.freeze(); > > storeBytes = config.get("doc.store.body.bytes", false); > > reuseFields = config.get("doc.reuse.fields", true); > > // In a multi-rounds run, it is important to reset DocState since > settings > // of fields may change between rounds, and this is the only way to > reset > // the cache of all threads. > docState = new ThreadLocal<DocState>(); > > indexProperties = config.get("doc.index.props", false); > > updateDocIDLimit = config.get("doc.random.id.limit", -1); > if (updateDocIDLimit != -1) { > r = new Random(179); > } > } > > > > And the following is the .alg file that I set: > > ### START OF FILE: just an example > content.source=org.apache.lucene.benchmark.byTask.feeds.TrecContentSource > content.source.verbose=false > content.source.excludeIteration=true > doc.maker.forever=false > doc.index.props=true > content.source.log.step=2500 > docs.dir=PATH_TO_MY_DATASET > doc.term.vector=true > work.dir=work > analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer > trec.doc.parser=org.apache.lucene.benchmark.byTask.feeds.TrecParserByPath > content.source.forever=false > content.source.encoding=UTF-8 > directory=FSDirectory > doc.stored=true > doc.tokenized=true > doc.tokenized.norms=true > doc.body.tokenized.norms=true > content.source.excludeIteration=true > ResetSystemErase > CreateIndex > { AddDoc } : * > CloseIndex > ### END OF FILE > > > Regards, > Sachin Kulkarni > > On Tue, Aug 19, 2014 at 1:59 PM, Sachin Kulkarni <kulk...@hawk.iit.edu> > wrote: > >> Hi Kumaran, >> >> I am using the benchmark utility from Lucene and doing the indexing via >> an .alg file. >> Would you like to see the alg file instead? >> >> Thank you. >> >> Regards, >> Sachin >> >> >> On Tue, Aug 19, 2014 at 9:42 AM, Kumaran Ramasubramanian < >> kums....@gmail.com> wrote: >> >>> Hi Sachin >>> >>> i want to look into ur indexing code. please share it >>> >>> - >>> Kumaran R >>> >>> >>> >>> >>> >>> On Tue, Aug 19, 2014 at 7:18 PM, Sachin Kulkarni <kulk...@hawk.iit.edu> >>> wrote: >>> >>> > Hi, >>> > >>> > Sorry for all the code, It got sent out accidentally. >>> > >>> > The following code is part of the Benchmark utility in Lucene, >>> specifically >>> > SubmissionReport.java >>> > >>> > >>> > // Here reader is the IndexReader. >>> > >>> > >>> > Iterator itr = docMap.entrySet().iterator(); >>> > int totalNumDocuments = reader.numDocs(); >>> > ScoreDoc sd[] = td.scoreDocs; >>> > String sep = " \t "; >>> > DocNameExtractor docext = new DocNameExtractor(docNameField); >>> > for (int i=0; i<sd.length; i++) >>> > { >>> > String docName = docext.docName(searcher,sd[i].doc); >>> > // ***** The Map of documents will help us get the docid >>> > int indexedDocID = docMap.get(docName); >>> > Fields fields = reader.getTermVectors(indexedDocID); >>> > Iterator<String> strItr=fields.iterator(); >>> > >>> > /// ********** The following while is printing the fieldNames which >>> only >>> > show 2 fields out of the 5 that I am looking for. >>> > while(strItr.hasNext()) >>> > { >>> > String fieldName = strItr.next(); >>> > System.out.println("next field " + fieldName); >>> > } >>> > Document DocList= reader.document(indexedDocID); >>> > List<IndexableField> field_list = DocList.getFields(); >>> > >>> > /// ****** The following for loop prints the five fields and >>> it's >>> > related information. >>> > for(int j=0; j < field_list.size(); j++) >>> > { >>> > System.out.println ( "list field is : " + field_list.get(j).name() ); >>> > IndexableFieldType IFT = field_list.get(j).fieldType(); >>> > System.out.println(" Field storeTermVectorOffsets : " + >>> > IFT.storeTermVectorOffsets()); >>> > System.out.println(" Field stored :" + IFT.stored()); >>> > } >>> > // ***************************** // >>> > } >>> > >>> > >>> > /**** THE OUTPUT for this section of code is >>> > fields size : 2 >>> > next field body >>> > next field docname >>> > >>> > list field is : docid >>> > Field storeTermVectorOffsets : false >>> > list field is : docname >>> > Field storeTermVectorOffsets : false >>> > list field is : docdate >>> > Field storeTermVectorOffsets : false >>> > list field is : doctitle >>> > Field storeTermVectorOffsets : false >>> > list field is : body >>> > Field storeTermVectorOffsets : false >>> > >>> > *******/ >>> > >>> > Hope this code comes out legible in the email. >>> > >>> > Thank you. >>> > >>> > Regards, >>> > Sachin Kulkarni >>> > >>> > >>> > On Tue, Aug 19, 2014 at 8:39 AM, Sachin Kulkarni <kulk...@hawk.iit.edu >>> > >>> > wrote: >>> > >>> > > Hi Kumaran, >>> > > >>> > > >>> > > >>> > > The following code is part of the Benchmark utility in Lucene, >>> > > specifically SubmissionReport.java >>> > > >>> > > >>> > > Iterator itr = docMap.entrySet().iterator(); >>> > > int totalNumDocuments = reader.numDocs(); >>> > > ScoreDoc sd[] = td.scoreDocs; >>> > > String sep = " \t "; >>> > > DocNameExtractor docext = new DocNameExtractor(docNameField); >>> > > for (int i=0; i<sd.length; i++) >>> > > { >>> > > System.out.println("i = " + i); >>> > > String docName = docext.docName(searcher,sd[i].doc); >>> > > System.out.println("docName : " + docName + "\t map size " + >>> > > docMap.size()); >>> > > // ***** The Map will help us get the docid and >>> > > int indexedDocID = docMap.get(docName); >>> > > System.out.println("indexed doc id : " + indexedDocID + "\t docname >>> : " >>> > > + docName); >>> > > // ******** GET THE tf-idf data now ************ // >>> > > Fields fields = reader.getTermVectors(indexedDocID); >>> > > System.out.println("fields size : " + fields.size()); >>> > > // **** Print log output for testing **** // >>> > > Iterator<String> strItr=fields.iterator(); >>> > > while(strItr.hasNext()) >>> > > { >>> > > String fieldName = strItr.next(); >>> > > System.out.println("next field " + fieldName); >>> > > } >>> > > Document DocList= reader.document(indexedDocID); >>> > > List<IndexableField> field_list = DocList.getFields(); >>> > > for(int j=0; j < field_list.size(); j++) >>> > > { >>> > > System.out.println ( "list field is : " + field_list.get(j).name() ); >>> > > IndexableFieldType IFT = field_list.get(j).fieldType(); >>> > > System.out.println(" Field storeTermVectorOffsets : " + >>> > > IFT.storeTermVectorOffsets()); >>> > > //System.out.println(" Field stored :" + IFT.stored()); >>> > > //for (FieldInfo.IndexOptions c : IFT.indexOptions().values()) >>> > > // System.out.println(c); >>> > > } >>> > > // *****************************88 // >>> > > >>> > > >>> > > On Tue, Aug 19, 2014 at 2:04 AM, Kumaran Ramasubramanian < >>> > > kums....@gmail.com> wrote: >>> > > >>> > >> Hi Sachin Kulkarni, >>> > >> >>> > >> If possible, Please share your code. >>> > >> >>> > >> >>> > >> - >>> > >> Kumaran R >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> On Tue, Aug 19, 2014 at 9:07 AM, Sachin Kulkarni < >>> kulk...@hawk.iit.edu> >>> > >> wrote: >>> > >> >>> > >> > Hi, >>> > >> > >>> > >> > I am using Lucene 4.6.0. >>> > >> > >>> > >> > I have been storing 5 fields for my documents in the index, namely >>> > body, >>> > >> > title, docname, docdate and docid. >>> > >> > >>> > >> > But when I get the fields using >>> > >> IndexReader.getTermVectors(indexedDocID) I >>> > >> > only get >>> > >> > the docname and body fields and can retrieve the term vectors for >>> > those >>> > >> > fields, but not others. >>> > >> > >>> > >> > I check to see if all the five fields are stored using >>> > >> > IndexedFieldType.stored() >>> > >> > and all return true. I also check to see that all the fields are >>> > indexed >>> > >> > and they are, but >>> > >> > still when I try to getTermVectors I only receive two fields back. >>> > >> > >>> > >> > Is there any other config setting that I am missing while indexing >>> > that >>> > >> is >>> > >> > causing this behavior? >>> > >> > >>> > >> > Thanks to Kumaran and Ian for their answers to my previous >>> questions >>> > >> but I >>> > >> > have not been able to figure out the above one yet. >>> > >> > >>> > >> > Thank you very much. >>> > >> > >>> > >> > Regards, >>> > >> > Sachin >>> > >> > >>> > >> >>> > > >>> > > >>> > >>> >> >> >