Re: Writing new indexes from index readers slow!

Jed Glazner Thu, 21 Mar 2013 23:27:41 -0700

Thanks Otis,

I had not considered that approach, however not all of our fields are stored so 
that's not going to work for me.


I'm wondering if its slow because there is just the one reader getting passed 
to the index writer... I noticed today that the addIndexes method can take an 
array of readers.  Maybe if I can send in an array of readers for the 
individual segments in the index...

I'll try that tomorrow.

Jed

Otis Gospodnetic <otis.gospodne...@gmail.com> wrote:



Jed,

While this is something completely different, have you considered using 
SolrEntityProcessor instead? (assuming all your fields are stored)
http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Thu, Mar 21, 2013 at 2:25 PM, Jed Glazner 
<jglaz...@adobe.com<mailto:jglaz...@adobe.com>> wrote:
Hey Hey Everybody!,

I'm not sure if I should have posted this to the developers list... if i'm 
totally barking up the wrong tree here, please let me know!

Anywho, I've developed a command line utility based on the 
MultiPassIndexSplitter class from the lucene library, but I'm finding that on 
our large index (350GB), it's taking WAY to long to write the newly split 
indexes! It took 20.5 hours for execution to finish. I should note that solr is 
not running while I'm splitting the index. Because solr can't really be running 
while I run this tool performance is critical as our service will be down!

I am aware that there is an api currently under development on trunk in solr 
cloud (https://issues.apache.org/jira/browse/SOLR-3755) but I need something 
now as our large index wreaking havoc on our service.

Here is some basic context info:

The Index:
==============
Solr/Lucene 4.1
Index Size: 350GB
Documents: 185,194,528

The Hardware (http://aws.amazon.com/ec2/instance-types/):
===============
AWS High-Memory X-Large (m2.xlarge) instance
CPU: 8 cores (2 virtual cores with 3.25 EC2 Compute Units each)
17.1 GB ram
1.2TB ebs raid

The Process (splitting 1 index into 8):
===============
I'm trying to split this index into 8 separate indexes using this tool.  To do 
this I create 8 worker threads.  Each thread creates gets a new 
FakeDeleteIndexReader object, and loops over every document, and uses a hash 
algorithm to decide if it should keep or delete the document.  Note that the 
documents are not actually deleted at this point because (as I understand it) 
the  FakeDeleteIndexReader emulates deletes without actually modifying the 
underlying index.

After each worker has determined which documents it should keep I create a new 
Directory object, Instanciate a new IndexWriter, and pass the 
FakeDeleteIndexReader object to the addIndexs method. (this is the part that 
takes forever!)

It only takes about an hour for all of the threads to hash/delete the documents 
it doesn't want. However it takes 19+ hours to write all of the new indexes!  
Watching iowait  The disk doesn't look to be over worked (about 85% idle), so 
i'm baffled as to why it would take that long!  I've tried running the write 
operations inside the worker threads, and serialy with no real difference!

Here is the relevant code that I'm using to write the indexes:

/**
 * Creates/merges a new index with a FakeDeleteIndexReader. The reader should 
have marked/deleted all
 * of the documents that should not be included in this new index. When the 
index is written/committed
 * these documents will be removed.
 *
 * @param directory
 *            The directory object of the new index
 * @param version
 *            The lucene version of the index
 * @param reader
 *            A FakeDeleteIndexReader that contains lots of uncommitted deletes.
 * @throws IOException
 */
private void writeToDisk(Directory directory, Version version, 
FakeDeleteIndexReader reader) throws IOException
{
    IndexWriterConfig cfg = new IndexWriterConfig(version, new 
WhitespaceAnalyzer(version));
    cfg.setOpenMode(OpenMode.CREATE);

    IndexWriter w = new IndexWriter(directory, cfg);
    w.addIndexes(reader);
    w.commit();
    w.close();
    reader.close();
}

Any Ideas??  I'm happy to share more snippets of source code if that is 
helpful..
--
[cid:part1.06000602.01000009@adobe.com]

Jed Glazner
Sr. Software Engineer
Adobe Social

385.221.1072<tel:385.221.1072> (tel)
801.360.0181<tel:801.360.0181> (cell)
jglaz...@adobe.com<mailto:jglaz...@adobe.com>

550 East Timpanogus Circle
Orem, UT 84097-6215, USA
www.adobe.com<http://www.adobe.com>

Re: Writing new indexes from index readers slow!

Reply via email to