Hi Kumaran, See below some part of the code and the .alg file. Here is the function from DocMaker.java from the package "package org.apache.lucene.benchmark.byTask.feeds"
/** Set the configuration parameters of this doc maker. */ public void setConfig(Config config, ContentSource source) { this.config = config; this.source = source; boolean stored = config.get("doc.stored", false); boolean bodyStored = config.get("doc.body.stored", stored); boolean tokenized = config.get("doc.tokenized", true); boolean bodyTokenized = config.get("doc.body.tokenized", tokenized); boolean norms = config.get("doc.tokenized.norms", false); boolean bodyNorms = config.get("doc.body.tokenized.norms", true); boolean termVec = config.get("doc.term.vector", false); boolean termVecPositions = config.get("doc.term.vector.positions", false); boolean termVecOffsets = config.get("doc.term.vector.offsets", false); valType = new FieldType(TextField.TYPE_NOT_STORED); valType.setStored(stored); valType.setTokenized(tokenized); valType.setOmitNorms(!norms); valType.setStoreTermVectors(termVec); valType.setStoreTermVectorPositions(termVecPositions); valType.setStoreTermVectorOffsets(termVecOffsets); valType.freeze(); bodyValType = new FieldType(TextField.TYPE_NOT_STORED); bodyValType.setStored(bodyStored); bodyValType.setTokenized(bodyTokenized); bodyValType.setOmitNorms(!bodyNorms); bodyValType.setStoreTermVectors(termVec); bodyValType.setStoreTermVectorPositions(termVecPositions); bodyValType.setStoreTermVectorOffsets(termVecOffsets); bodyValType.freeze(); storeBytes = config.get("doc.store.body.bytes", false); reuseFields = config.get("doc.reuse.fields", true); // In a multi-rounds run, it is important to reset DocState since settings // of fields may change between rounds, and this is the only way to reset // the cache of all threads. docState = new ThreadLocal<DocState>(); indexProperties = config.get("doc.index.props", false); updateDocIDLimit = config.get("doc.random.id.limit", -1); if (updateDocIDLimit != -1) { r = new Random(179); } } And the following is the .alg file that I set: ### START OF FILE: just an example content.source=org.apache.lucene.benchmark.byTask.feeds.TrecContentSource content.source.verbose=false content.source.excludeIteration=true doc.maker.forever=false doc.index.props=true content.source.log.step=2500 docs.dir=PATH_TO_MY_DATASET doc.term.vector=true work.dir=work analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer trec.doc.parser=org.apache.lucene.benchmark.byTask.feeds.TrecParserByPath content.source.forever=false content.source.encoding=UTF-8 directory=FSDirectory doc.stored=true doc.tokenized=true doc.tokenized.norms=true doc.body.tokenized.norms=true content.source.excludeIteration=true ResetSystemErase CreateIndex { AddDoc } : * CloseIndex ### END OF FILE Regards, Sachin Kulkarni On Tue, Aug 19, 2014 at 1:59 PM, Sachin Kulkarni <kulk...@hawk.iit.edu> wrote: > Hi Kumaran, > > I am using the benchmark utility from Lucene and doing the indexing via an > .alg file. > Would you like to see the alg file instead? > > Thank you. > > Regards, > Sachin > > > On Tue, Aug 19, 2014 at 9:42 AM, Kumaran Ramasubramanian < > kums....@gmail.com> wrote: > >> Hi Sachin >> >> i want to look into ur indexing code. please share it >> >> - >> Kumaran R >> >> >> >> >> >> On Tue, Aug 19, 2014 at 7:18 PM, Sachin Kulkarni <kulk...@hawk.iit.edu> >> wrote: >> >> > Hi, >> > >> > Sorry for all the code, It got sent out accidentally. >> > >> > The following code is part of the Benchmark utility in Lucene, >> specifically >> > SubmissionReport.java >> > >> > >> > // Here reader is the IndexReader. >> > >> > >> > Iterator itr = docMap.entrySet().iterator(); >> > int totalNumDocuments = reader.numDocs(); >> > ScoreDoc sd[] = td.scoreDocs; >> > String sep = " \t "; >> > DocNameExtractor docext = new DocNameExtractor(docNameField); >> > for (int i=0; i<sd.length; i++) >> > { >> > String docName = docext.docName(searcher,sd[i].doc); >> > // ***** The Map of documents will help us get the docid >> > int indexedDocID = docMap.get(docName); >> > Fields fields = reader.getTermVectors(indexedDocID); >> > Iterator<String> strItr=fields.iterator(); >> > >> > /// ********** The following while is printing the fieldNames which only >> > show 2 fields out of the 5 that I am looking for. >> > while(strItr.hasNext()) >> > { >> > String fieldName = strItr.next(); >> > System.out.println("next field " + fieldName); >> > } >> > Document DocList= reader.document(indexedDocID); >> > List<IndexableField> field_list = DocList.getFields(); >> > >> > /// ****** The following for loop prints the five fields and >> it's >> > related information. >> > for(int j=0; j < field_list.size(); j++) >> > { >> > System.out.println ( "list field is : " + field_list.get(j).name() ); >> > IndexableFieldType IFT = field_list.get(j).fieldType(); >> > System.out.println(" Field storeTermVectorOffsets : " + >> > IFT.storeTermVectorOffsets()); >> > System.out.println(" Field stored :" + IFT.stored()); >> > } >> > // ***************************** // >> > } >> > >> > >> > /**** THE OUTPUT for this section of code is >> > fields size : 2 >> > next field body >> > next field docname >> > >> > list field is : docid >> > Field storeTermVectorOffsets : false >> > list field is : docname >> > Field storeTermVectorOffsets : false >> > list field is : docdate >> > Field storeTermVectorOffsets : false >> > list field is : doctitle >> > Field storeTermVectorOffsets : false >> > list field is : body >> > Field storeTermVectorOffsets : false >> > >> > *******/ >> > >> > Hope this code comes out legible in the email. >> > >> > Thank you. >> > >> > Regards, >> > Sachin Kulkarni >> > >> > >> > On Tue, Aug 19, 2014 at 8:39 AM, Sachin Kulkarni <kulk...@hawk.iit.edu> >> > wrote: >> > >> > > Hi Kumaran, >> > > >> > > >> > > >> > > The following code is part of the Benchmark utility in Lucene, >> > > specifically SubmissionReport.java >> > > >> > > >> > > Iterator itr = docMap.entrySet().iterator(); >> > > int totalNumDocuments = reader.numDocs(); >> > > ScoreDoc sd[] = td.scoreDocs; >> > > String sep = " \t "; >> > > DocNameExtractor docext = new DocNameExtractor(docNameField); >> > > for (int i=0; i<sd.length; i++) >> > > { >> > > System.out.println("i = " + i); >> > > String docName = docext.docName(searcher,sd[i].doc); >> > > System.out.println("docName : " + docName + "\t map size " + >> > > docMap.size()); >> > > // ***** The Map will help us get the docid and >> > > int indexedDocID = docMap.get(docName); >> > > System.out.println("indexed doc id : " + indexedDocID + "\t docname >> : " >> > > + docName); >> > > // ******** GET THE tf-idf data now ************ // >> > > Fields fields = reader.getTermVectors(indexedDocID); >> > > System.out.println("fields size : " + fields.size()); >> > > // **** Print log output for testing **** // >> > > Iterator<String> strItr=fields.iterator(); >> > > while(strItr.hasNext()) >> > > { >> > > String fieldName = strItr.next(); >> > > System.out.println("next field " + fieldName); >> > > } >> > > Document DocList= reader.document(indexedDocID); >> > > List<IndexableField> field_list = DocList.getFields(); >> > > for(int j=0; j < field_list.size(); j++) >> > > { >> > > System.out.println ( "list field is : " + field_list.get(j).name() ); >> > > IndexableFieldType IFT = field_list.get(j).fieldType(); >> > > System.out.println(" Field storeTermVectorOffsets : " + >> > > IFT.storeTermVectorOffsets()); >> > > //System.out.println(" Field stored :" + IFT.stored()); >> > > //for (FieldInfo.IndexOptions c : IFT.indexOptions().values()) >> > > // System.out.println(c); >> > > } >> > > // *****************************88 // >> > > >> > > >> > > On Tue, Aug 19, 2014 at 2:04 AM, Kumaran Ramasubramanian < >> > > kums....@gmail.com> wrote: >> > > >> > >> Hi Sachin Kulkarni, >> > >> >> > >> If possible, Please share your code. >> > >> >> > >> >> > >> - >> > >> Kumaran R >> > >> >> > >> >> > >> >> > >> >> > >> >> > >> On Tue, Aug 19, 2014 at 9:07 AM, Sachin Kulkarni < >> kulk...@hawk.iit.edu> >> > >> wrote: >> > >> >> > >> > Hi, >> > >> > >> > >> > I am using Lucene 4.6.0. >> > >> > >> > >> > I have been storing 5 fields for my documents in the index, namely >> > body, >> > >> > title, docname, docdate and docid. >> > >> > >> > >> > But when I get the fields using >> > >> IndexReader.getTermVectors(indexedDocID) I >> > >> > only get >> > >> > the docname and body fields and can retrieve the term vectors for >> > those >> > >> > fields, but not others. >> > >> > >> > >> > I check to see if all the five fields are stored using >> > >> > IndexedFieldType.stored() >> > >> > and all return true. I also check to see that all the fields are >> > indexed >> > >> > and they are, but >> > >> > still when I try to getTermVectors I only receive two fields back. >> > >> > >> > >> > Is there any other config setting that I am missing while indexing >> > that >> > >> is >> > >> > causing this behavior? >> > >> > >> > >> > Thanks to Kumaran and Ian for their answers to my previous >> questions >> > >> but I >> > >> > have not been able to figure out the above one yet. >> > >> > >> > >> > Thank you very much. >> > >> > >> > >> > Regards, >> > >> > Sachin >> > >> > >> > >> >> > > >> > > >> > >> > >