Re: Problems Refactoring a Lucene Index

Michael McCandless Mon, 22 Aug 2016 13:49:40 -0700

It has never worked, though I do think the metadata has changed over time,
so the degree to which it didn't work has changed?


Mike McCandless

http://blog.mikemccandless.com

On Mon, Aug 22, 2016 at 4:41 PM, Stuart Goldberg <sgoldb...@fixflyer.com>
wrote:

> Understood, but did it used to work?
>
>
>
> Stuart M Goldberg
>
> Senior Vice President of Software Develpment
> *FIX Flyer LLC*
> http://www.FIXFlyer.com/
>
> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>
>
>
> *From:* Michael McCandless [mailto:luc...@mikemccandless.com]
> *Sent:* Monday, August 22, 2016 4:38 PM
> *To:* Stuart Goldberg <sgoldb...@fixflyer.com>
> *Cc:* Lucene Users <java-user@lucene.apache.org>
>
> *Subject:* Re: Problems Refactoring a Lucene Index
>
>
>
> The design is indeed trappy, and many users have hit the situation you
> have, and we have tried to fix this before (to change IndexReader.document
> to return a different class than Document), but it didn't "take":
> https://issues.apache.org/jira/browse/LUCENE-6971
>
>
>
> Have a look at FieldInfo.java to see the metadata it records.
>
>
>
> The challenge here is Lucene's schema-less-ness.  For example, on a
> document by document basis, you can change how term vectors are indexed,
> whether a field is stored, or omits norms, or indexes only docs and not
> freqs, etc., for the same field across documents, across segments.
>
>
>
> Lucene only stores in FieldInfo what is necessary for it to read the index
> files, and does not store metadata beyond that.
>
>
>
> Patches welcome :)  We should fix this trap; it's just that doing so is
> apparently not so easy.
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Mon, Aug 22, 2016 at 11:04 AM, Stuart Goldberg <sgoldb...@fixflyer.com>
> wrote:
>
> Thanks for the quick response.
>
>
>
> I kind of figured on my own that I had to recreate the document from
> scratch
>
>
>
> But there is something in your response that I don’t understand. You say 
> “Lucene
> only preserves the metadata it needs for each field”. What does that mean?
> In my posting I gave examples of metadata returned that is clearly the
> exact opposite of the metadata that was there when originally indexed.
>
>
>
> According to what you are saying there is metadata that is preserved
> correctly. What metadata is that?
>
>
>
> Not sure if you are just a Lucene guru (I have your Lucene in Action
> books!) or an actual author/contributor to the code, so my observation
> might not be appropriately directed at you. But it seems a questionable API
> design to return a “Document” from the index that has properties described
> by the Javadoc that give back bogus data.
>
>
>
> And what about the FieldInfo class that purports to give back field
> information. Why have such an API if the data it provides is bogus?
>
>
>
> Stuart M Goldberg
>
> Senior Vice President of Software Develpment
> *FIX Flyer LLC*
> http://www.FIXFlyer.com/
>
> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>
>
>
> *From:* Michael McCandless [mailto:luc...@mikemccandless.com]
> *Sent:* Monday, August 22, 2016 10:48 AM
> *To:* Lucene Users <java-user@lucene.apache.org>; sgoldb...@fixflyer.com
> *Subject:* Re: Problems Refactoring a Lucene Index
>
>
>
> This is unfortunately "by design": Lucene makes no guarantees that the
> Document you retrieve from an IndexReader is precisely the same Document
> you had indexed.
>
>
>
> Lucene only preserves the metadata it needs for each field.
>
>
>
> Your only recourse is to create a new Document using your application
> level information about which fields are tokenized, indexed, etc.
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
> On Fri, Jul 8, 2016 at 12:12 PM, Stuart Goldberg <sgoldb...@fixflyer.com>
> wrote:
>
> As our software goes through its lifecycle, we sometimes have to alter
> existing Lucene indexes. The way I have done that in the past is to open
> the
> existing index for reading, read each Document, modify it and write that
> Document to a new index. At the end of the process, I delete the old index
> and rename the new index to the old name.
>
> I do not do any tokenizing and use no analyzers.
>
> I recently upgraded from Lucene 3.x to 4.10.4. Now I have the following
> problem: Suppose the existing document has 10 fields in it and there's one
> I
> have to modify. I remove that field and re-add it with the new settings.
> Then I add the Document in its entirety to the new index. I run into the
> following problems:
>
> *       I get Exceptions thrown for the fields I don't even touch. That's
> because their FieldType has 'tokenized' set to true and it fails because I
> am using no analyzers. 'tokenized' is set to true even though when I
> originally added the field to the original index I had 'tokenized' set to
> false!
>
> *       I have LongFields that come back with 'indexed' set to false even
> though in the original index they were indexed! This makes the new index
> not
> searchable on these fields and hence unusable.
>
> *       I can't even alter 'indexed' for these LongFields because for some
> reason the FieldType instance comes back frozen from the IndexReader. Once
> frozen,  you can't alter it. Even if I create a new FieldType, there is no
> way to change the FieldType of a Field
>
> It seems the returned FieldType contents are kind of random!
>
> I did see in the Javadoc of IndexReader.document() that field metadata is
> not returned and that, in fact, that they should have new kind of object
> returned like 'StoredField' so there is no pretense of there being any
> metadata.
>
> I thought perhaps I could use FieldInfos. But that class returns the same
> bogus metadata.  What then is the purpose of FieldInfos if the info is
> bogus?
>
> Am I not understanding something here? This is not very usable. What can I
> do to work around this? Is this a Lucene bug? Oversight?
>
>
>
>
>

Re: Problems Refactoring a Lucene Index

Reply via email to