Hi, perhaps first some background:
I need to speed-up indexing for an particular application which has a pretty unsual schema: besides the normal stored and indexed fields we have about 20.000 fields per document which are all indexed/ non-stored sInts. Obviously indexing was really slow with such a number of fields. With indexing through Solr we got about 0.3 docs/ sec ( on a ec2 m1.large instance) Since these ~20.000 fields are all build/ calculated analogously, we figured it would be possible to possibly build a low-level indexer for these fields (of which we had domain knowledge which we could use to possibly speed indexing up) and later merge them with the other fields to construct the entire index. So we did and now achieve around 1.8 docs/ sec (6x speedup). Not bad, but still not enough. As part of calculating these fields, we keep track of all fields, terms per field, and docids per term ( per field) . all the stuff is then ordered: the fields, the terms available for each field, the docids per term and inserted in that order using lowel-level classes like: FormatPostingsFieldsWriter , FormatPostingsTermsConsumer, FormatPostingsDocsConsumer (pseudo-code below) This constructs the following files: .tis, .tii, .frq. ( + some default values for the other required files, which don't need actual data bc. these fields are not stored..) I should also mention that each call to the indexer writes all available fields, terms, docids to a new fsdirectory. So basically each call results in a complete index. (containing about 100 docs each, bc. otherwise we run into mem-problems keeping the ordered maps) Since we already have all fields, terms, docids in order it seems (a lot of ) overkill to me to be going through the methods that above-mentioned classes offer, which were meant for more 'non-sequential / non-ordered' inserts (AFAIK). What would be the best way to write .tis, .tii and .frq ni a more sequential matter? I'm looking for something that would construct a byte-array for each file that conforms to the index-file definition of that particular file (or something) . I could try to do it myself and completely bypass all indexing-classes altogether and just write the files to disk. (Possible bc. as mentioned we have all data needed to construct a complete index) . However, perhaps there are classes I'm not aware of that help in getting the format right (it seems like a lot of trial-and-error coding otherwise) Thanks for any help, pointers, etc. Geert-Jan PSEUDO-CODE of the current low-level (not-so) sequential indexer: foreach(String sField: fieldsInOrder){ --> add field to FieldInfos and grab newly created fieldInfo --> add fieldInfo to FormatPostingsFieldsWriter and grab formatPostingsTermsConsumer List<String> termsInOrder = termsInOrderForFieldMap(sField); foreach(String sTerm: termsInOrder){ FormatPostingsDocsConsumer frq = formatPostingsTermsConsumer.add(sTerm); List<Integer> docidsInOrderPerFieldTerm = docidsInOrderPerFieldTermMap(sField+"-"+sTerm); for(Integer docid:docidsInOrderPerFieldTerm){ frq.addDoc(docid); } //close relevant stuff } //close relevant stuff } //close relevant stuff -- View this message in context: http://n3.nabble.com/custom-low-level-indexer-to-speed-things-up-when-fields-terms-and-docids-are-in-order-tp576998p576998.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org