Upgrade Path Lucene 3.0.2 to 3.4

Paul Allan Hill Wed, 16 Nov 2011 13:55:57 -0800

As it says in the title, we are moving from 3.0.2 from to 3.4.  I am interested 
in issues about the need to build a new index or just keep changing the current 
one.   My company has been busy building software and have not upgraded the 
Lucene and Tika libraries since last year, but I'm trying to remedy that as 
quickly as I can.   We have production indices with 5,000,000 to 1,000,000 
English language documents.  These are business documents (the usual MS word, 
PDF ... ) which only the very occasional phrases in other character sets (for 
example, Japanese or Chinese company name inserted in an otherwise English 
document etc.).


So here are my high-level questions when doing such an upgrade jump

1.       Do we need to start from scratch and create a new index or can I 
re-crawl documents into the existing index?
My impression is that, if we were using 2.x the answer would definitely be that 
a rebuild is required, but the answer doesn't jump out at me in releases since 
then. I think the answer seems to be no.

2.       If we don't HAVE TO RE-CREATE the index, are their advantages to doing 
this?

a.       Should I be looking into eventually leveraging 
org.apache.lucene.index.IndexUpgrader (see 
LUCENE-3082<http://issues.apache.org/jira/browse/LUCENE-3082>)?

In our application there is one Lucene "service" running in this system and it 
will be running the latest code, so there is no issues of old code needing to 
access the index.

Because of the improvements over the last year in Tika, we will set our system 
to re-crawl all documents, so I believe this eliminates various issues 
involving tokenizing  fixes.
We have tests which demonstrate the new Lucene libraries when used to index and 
then search return the same (or improved) results.  We also have tests to 
verify that Tika does a great job of improving its ability to parse (three 
cheers to the Tika folks for parsing half the previously failing PDF and 40% of 
the old MS Word-95 docs).  Hats off to the folks involved in both - great job 
on both bug fixes and the new features!

But my question is about (1) updating libraries, but (2) using an existing 
index that will have all documents (eventually) replaced. Given my scenario 
what our my issues, if any?  I attempt to answer my own question below and I 
think the answer is I don't need to create a new clean index.
I would be interested in any feedback.

-Paul
p.s. If I had one suggestion, I would suggest that in the release note summary 
of a bug, it would be better form to eliminate any shorthand acronyms (or just 
throw in a link to either an appropriate description or even the JavaDoc).  
Obviously, in the bug discussion there will be all kinds of terse usage, but 
one liners in release notes are read by folks a little less informed about some 
of the parts of Lucene.

*********** Detailed Review Follows *******

Reviewing the releases at http://lucene.apache.org/java/docs/index.html

The Java 7 JVM optimization bug has been fixed.  This is great; we were aware 
of this, so never used Java 7.

The Unicode changes across JVMs referenced in the Java 7 and other JVM upgrades 
is interesting.
See for example the copy at:
https://github.com/apache/lucene-solr/blob/trunk/lucene/JRE_VERSION_MIGRATION.txt

In my case, we will be running the code under Java 7 while re-indexing, so I 
think all will be properly upgraded.

Reviewing the 3.4 bugs there only seem to be few that relate to the files in 
the index on disk:
LUCENE-3409<http://issues.apache.org/jira/browse/LUCENE-3409>: 
IndexWriter.deleteAll was [....], leading to unused files accumulating in the 
Directory.
My Comment: Curiously the details for this bug describe a memory leak, not a 
problem with files on disk, but anyway we aren't using Near Real-Time Readers 
(yet) and only use deleteAll when testing in test indexes.

LUCENE-3358<http://issues.apache.org/jira/browse/LUCENE-3358>, 
LUCENE-3361<http://issues.apache.org/jira/browse/LUCENE-3361>: 
StandardTokenizer and UAX29URLEmailTokenizer wrongly [...in ...] Han or 
Hiragana characters...
My Comment: This (if even relevant to us) would be fixed by re-indexing which 
we will be doing anyway.

LUCENE-3368<http://issues.apache.org/jira/browse/LUCENE-3368> IndexWriter 
applies wrong deletes during concurrent flush-all
My Comment: Only occurs when there are two writers which we don't have.  I 
thought only one writer was allowed, so I'm really not grokking this bug. Can 
any explain this one to me?

LUCENE-3365<http://issues.apache.org/jira/browse/LUCENE-3365>: ... can cause 
IndexWriter overriding an existing index.
My Comment: I think we would have known about this one if it did occur in our 
system, but it is now fixed.

LUCENE-3418<http://issues.apache.org/jira/browse/LUCENE-3418>: Lucene was 
failing to fsync index files on commit, meaning an operating system or hardware 
crash, or power loss, could easily corrupt the index.
My Comment:  This is the issue mentioned in the release announcement.  Luckily 
for us, even though we've had production environments crash during a power 
outage, we didn't see this.
Reading the notes on this, it seems this was a hard fail that was obvious when 
it occurred.

Reviewing the 3.3 release:
There appear to be no bugs which effected the files on disk that are not fixed 
by re-indexing.
Reviewing the 3.2.0 release:
LUCENE-3065<http://issues.apache.org/jira/browse/LUCENE-3065>:  In API changes 
it says, Document.getField() was deprecated. In changes in runtime behavior it 
says "... Document.getFieldable() returns NumericField instances".
My Comment:  We have more than one numeric fields in our index so have moved to 
using the Document.getFieldable(), so we're doing this the right way.

Reviewing 3.1.0 release:
There appear to be no bugs which effected the files on disk that are not fixed 
by re-indexing documents (for example 
LUCENE-2911<http://issues.apache.org/jira/browse/LUCENE-2911>).
Reviewing 3.0.3 release:
There appear to be no bugs which effected the files on disk that are not fixed 
by re-indexing documents.

That doesn't seem bad at all!
Comments?

Upgrade Path Lucene 3.0.2 to 3.4

Reply via email to