Re: Indexing a document that modifies itself as it's being indexed

Stephen Green Tue, 11 Mar 2014 13:41:37 -0700

Thanks, Mike.

Once I was that deep in the guts of the indexer, I knew things were
probably not going to go my way.


I'll check out CachingTokenFilter.



On Tue, Mar 11, 2014 at 3:09 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> You can't rely on how IndexWriter will iterate/consume those fields;
> that's an implementation detail.
>
> Maybe you could use CachingTokenFilter to pre-process the text fields
> and append the new fields?  And then during indexing, replay the
> cached tokens, so you don't have to tokenize twice.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Mar 11, 2014 at 2:33 PM, Stephen Green <eelstretch...@gmail.com>
> wrote:
> > I'm working on a system that uses Lucene 4.6.0, and I have a couple of
> use
> > cases for documents that modify themselves as they're being indexed.
> >
> > For example, we have text classifiers that we would like to run on the
> > contents of certain fields.  These classifiers produce field values
> (i.e.,
> > the classes that the document is in) that I would like to be part of the
> > document.
> >
> > Now, the text classifiers want to tokenize the text in order to do the
> > classification, and I'd like to avoid re-tokenizing the text multiple
> > times, so I can build a token filter that collects the tokens and then
> runs
> > the classifier.  This filter can know about the oald.Document that's
> being
> > processed, but I suspected that adding elements to Document.fields  while
> > it's being indexed would lead to a concurrent modification exception.
> >
> > Since IndexWriter.addDocument takes an Iterable<IndexableField>, I
> figured
> > I could just make my own document class that implemented Iterable, but
> > would allow me to add new fields onto the end of the document and extend
> > the iteration to cover those fields.
> >
> > I did this, but it didn't have the effect that I was hoping for, because
> > the fields that were added were never processed.
> >
> > Working through the code, I discovered that
> > DocFieldProcessor.processDocument iterates through all the fields in the
> > document, collecting them by field name (using it's own hash table?)
> before
> > processing them.
> >
> > Of  course, this breaks my add-fields-as-other-fields-are-being-processed
> > approach because the iterator is exhausted before any of the processing
> > happens.
> >
> > So, my questions are: Does it make any sense to try to do this?  If so,
> is
> > there an approach that will work without having to rewrite a lot of
> > indexing code?
> >
> > Thanks,
> >
> > Steve Green
> > --
> > Stephen Green
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Stephen Green
http://thesearchguy.wordpress.com

Re: Indexing a document that modifies itself as it's being indexed

Reply via email to