Thanks, Mike. Once I was that deep in the guts of the indexer, I knew things were probably not going to go my way.
I'll check out CachingTokenFilter. On Tue, Mar 11, 2014 at 3:09 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > You can't rely on how IndexWriter will iterate/consume those fields; > that's an implementation detail. > > Maybe you could use CachingTokenFilter to pre-process the text fields > and append the new fields? And then during indexing, replay the > cached tokens, so you don't have to tokenize twice. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Mar 11, 2014 at 2:33 PM, Stephen Green <eelstretch...@gmail.com> > wrote: > > I'm working on a system that uses Lucene 4.6.0, and I have a couple of > use > > cases for documents that modify themselves as they're being indexed. > > > > For example, we have text classifiers that we would like to run on the > > contents of certain fields. These classifiers produce field values > (i.e., > > the classes that the document is in) that I would like to be part of the > > document. > > > > Now, the text classifiers want to tokenize the text in order to do the > > classification, and I'd like to avoid re-tokenizing the text multiple > > times, so I can build a token filter that collects the tokens and then > runs > > the classifier. This filter can know about the oald.Document that's > being > > processed, but I suspected that adding elements to Document.fields while > > it's being indexed would lead to a concurrent modification exception. > > > > Since IndexWriter.addDocument takes an Iterable<IndexableField>, I > figured > > I could just make my own document class that implemented Iterable, but > > would allow me to add new fields onto the end of the document and extend > > the iteration to cover those fields. > > > > I did this, but it didn't have the effect that I was hoping for, because > > the fields that were added were never processed. > > > > Working through the code, I discovered that > > DocFieldProcessor.processDocument iterates through all the fields in the > > document, collecting them by field name (using it's own hash table?) > before > > processing them. > > > > Of course, this breaks my add-fields-as-other-fields-are-being-processed > > approach because the iterator is exhausted before any of the processing > > happens. > > > > So, my questions are: Does it make any sense to try to do this? If so, > is > > there an approach that will work without having to rewrite a lot of > > indexing code? > > > > Thanks, > > > > Steve Green > > -- > > Stephen Green > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Stephen Green http://thesearchguy.wordpress.com