RE: Lucene 3.0.0 writer with a Lucene 2.3.1 index

Rob Staveley (Tom) Sat, 12 Dec 2009 02:44:37 -0800

Thanks for picking up on this Anshum and Uwe.

I used the following approach to convert by 2.3 index (which yes, was
optimised already) to 3.0...


   Using 3.0 Lucene, I created a new empty index with my IndexWriter. I
opened my 2.3 index with an IndexReader. I added the 2.3 index with
writer.addIndexes(reader) and then optimized and committed. I assume that
counts as 2 segments being optimized, despite the fact that my new segment
would have been empty. Of my 3 indexes I noticed a small growth in the index
which has no compressed fields and a very small shrink in the two indexes
which did have compressed fields.

So, it looks like it wasn't a no-op, but looks like I was compressing <1K
fields, as Uwe suspected. Typically these were synopsis fields with
3-sentence extracts from the texts being indexed. I hadn't realised that the
threshold was as high as 1K to pay dividends. I would have been better off
not compressing those fields.

It looks like I'll benefit from Lucene 3 stopping me from abusing
compression! 8-)

Many thanks!

-----Original Message-----
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: 11 December 2009 18:43
To: java-user@lucene.apache.org
Subject: RE: Lucene 3.0.0 writer with a Lucene 2.3.1 index

The index *should* grow after merging/optimizing, but it will only do this,
if the fields you had compressed were not bigger then without compression.
One of the tests showed: A string field with 80 ascii chars needed
compressed about 250 bytes, which is 3 times (as chars are UTF-8 encoded)
the uncompressed size. So it was always a bad idea to compress only short
fields, compression for say fields<1024 chars is simply waste of time and
disk space.

So maybe you hit bthis issue: Some fields were so small that the compressed
representation were larger than uncompressed. And others the other way
round. This leads to o/o change.

By the way, if your index was already optimized in 2.3 and you try to
optimize it in 3.0, it will be a no-op, as optimization needs at least two
segments to merge.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Anshum [mailto:ansh...@gmail.com]
> Sent: Friday, December 11, 2009 7:31 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 3.0.0 writer with a Lucene 2.3.1 index
> 
> Hi Tom,
> Pt 3: As per my knowledge, it wouldn't be a 'mixture' of 2 index types.
> Rather, as soon as you optimize (or do a IndexWriter operation on the
> current index), it would expand the index to a non compressed format. I
> read
> it somewhere in the release notes that on doing so, a growth in the index
> size should be anticipated and handled.
> 
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
> 
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
> 
> 
> On Fri, Dec 11, 2009 at 10:50 PM, Rob Staveley (Tom)
> <rstave...@seseit.com>wrote:
> 
> > I'm upgrading from 2.3.1 to 3.0.0. I have 3.0.0 index readers ready to
> go
> > into production and writers in the process of upgrading to 3.0.0.
> >
> > I think understand the implications of
> > http://wiki.apache.org/lucene-java/BackwardsCompatibility#File_Formats
> for
> > the upgrade, but I'd love it if someone could validate my following
> > assumptions.
> >
> >  1. My 2.3.1 indexes have compressed fields in them, which the 3.0.0
> > readers work nicely with, as expected. I should assume that my 3.0.0
> > readers
> > will continue to handle 2.3.1 indexes OK.
> >
> >  2. Presumably Lucene all future 3.x index readers will continue to
> handle
> > compressed fields and we should only anticipate Lucene 4.x choking on
> them.
> >
> > I was naively expecting my index directories to grow when my 3.0.0 index
> > writer merged the 2.3.1 indexes and/or optimize()'d them converting them
> to
> > 3.0.0. However, I don't see that. Presumably that means that....
> >
> >  3. Documents added to existing 2.3.1 indexes will be added conforming
> to
> > 3.0.0, but existing documents in the index will continue to have
> compressed
> > content and old documents can coexist happily with the new ones, and my
> > indexes will become a mixture of 2.3.1 and 3.0.0.
> >
> >  4. I should use
> >
> >
> http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/util/Version
> .h
> > tml#LUCENE_23 for the StandardAnalyzer and QueryParser in mixed indexes
> in
> > 3.0.0 if I want to handle analysis consistently, or go for
> LUCENE_CURRENT
> > if
> > I want to handle the new content "better" (bearing in mind that the new
> > content will eventually replace the old content anyhow).
> >
> >  5. I should use
> >
> >
> http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/Sto
> pF
> >
> >
> ilter.html#StopFilter%28boolean,%20org.apache.lucene.analysis.TokenStream,
> %2
> > 0java.util.Set%29 with enablePositionIncrements=false in mixed indexes
> in
> > 3.0.0 if I want to handle analysis consistently, or go for
> > enablePositionIncrements=true if I want to handle the new content
> "better"
> > (bearing in mind that the new content will eventually replace the old
> > content anyhow).
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Lucene 3.0.0 writer with a Lucene 2.3.1 index

Reply via email to