Re: Indexing a document that modifies itself as it's being indexed

Michael McCandless Tue, 11 Mar 2014 12:10:34 -0700

You can't rely on how IndexWriter will iterate/consume those fields;
that's an implementation detail.


Maybe you could use CachingTokenFilter to pre-process the text fields
and append the new fields?  And then during indexing, replay the
cached tokens, so you don't have to tokenize twice.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Mar 11, 2014 at 2:33 PM, Stephen Green <eelstretch...@gmail.com> wrote:
> I'm working on a system that uses Lucene 4.6.0, and I have a couple of use
> cases for documents that modify themselves as they're being indexed.
>
> For example, we have text classifiers that we would like to run on the
> contents of certain fields.  These classifiers produce field values (i.e.,
> the classes that the document is in) that I would like to be part of the
> document.
>
> Now, the text classifiers want to tokenize the text in order to do the
> classification, and I'd like to avoid re-tokenizing the text multiple
> times, so I can build a token filter that collects the tokens and then runs
> the classifier.  This filter can know about the oald.Document that's being
> processed, but I suspected that adding elements to Document.fields  while
> it's being indexed would lead to a concurrent modification exception.
>
> Since IndexWriter.addDocument takes an Iterable<IndexableField>, I figured
> I could just make my own document class that implemented Iterable, but
> would allow me to add new fields onto the end of the document and extend
> the iteration to cover those fields.
>
> I did this, but it didn't have the effect that I was hoping for, because
> the fields that were added were never processed.
>
> Working through the code, I discovered that
> DocFieldProcessor.processDocument iterates through all the fields in the
> document, collecting them by field name (using it's own hash table?) before
> processing them.
>
> Of  course, this breaks my add-fields-as-other-fields-are-being-processed
> approach because the iterator is exhausted before any of the processing
> happens.
>
> So, my questions are: Does it make any sense to try to do this?  If so, is
> there an approach that will work without having to rewrite a lot of
> indexing code?
>
> Thanks,
>
> Steve Green
> --
> Stephen Green

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing a document that modifies itself as it's being indexed

Reply via email to