RE: Profiling lucene 5.2.0 based tool

Uwe Schindler Tue, 23 Feb 2016 01:41:34 -0800

Hi,

There is nothing you can improve in a single-threaded case. You can only 
parallelize to get more out of it. Lucene is optimized to do parallel 
processing while indexing so you should make use of that.


> > > Note: After going through the code I found out that addDocument is
> > > internally calling updateDocument only. Is there any way by which we can
> > > avoid calling updateDocument and only use addDocument API?

Updating a document is deleting and reindexing a new one. So both share the 
same internal logic, so it is perfectly fine to delegate internally. The only 
difference is that addDocument just does not delete the previous one. It does 
this by passing null, which makes updateDocument to not delete any previous 
document. So there is nothing to improve.

Uwe
 
-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: sandeep das [mailto:yarnhad...@gmail.com]
> Sent: Tuesday, February 23, 2016 8:30 AM
> To: java-user@lucene.apache.org
> Subject: Re: Profiling lucene 5.2.0 based tool
> 
> Hi Rob,
> 
> The statistics which I had shared were provided using one thread for
> indexing. I wish to use only 1 thread and want to process maximum
> 10MBps(Mega Bytes per second) of data rate. I believe with single thread it
> should be achievable.
> 
> Regards,
> Sandeep
> 
> On Tue, Feb 23, 2016 at 12:50 PM, Rob Audenaerde
> <rob.audenae...@gmail.com>
> wrote:
> 
> > Hi Sandeep,
> >
> > How many threads do you use to do the indexing? The benchmarks of
> Lucene
> > are done on >20 threads IIRC.
> >
> > -Rob
> >
> > On Tue, Feb 23, 2016 at 8:01 AM, sandeep das <yarnhad...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I've implemented a tool using lucene-5.2.0 to index my CSV files. The
> > tool
> > > is reading data from CSV files(residing on disk) and creating indexes on
> > > local disk. It is able to process 3.5 MBps data. There are overall 46
> > > fields being added in one document. They are only of three data types 1.
> > > Integer, 2. Long, 3. String.
> > > All these fields are part of one CSV record and they are parsed using
> > > custom CSV parser which is faster than any split method of string.
> > >
> > > I've configured the following parameters to create indexWriter
> > > 1. setOpenMode(OpenMode.CREATE)
> > > 2. setCommitOnClose(true)
> > > 3. setRAMBufferSizeMB(512)   // Tried 256, 312 as well but performance is
> > > almost same.
> > >
> > > I've read over several blogs that lucene works way faster than these
> > > figures. So, I thought there are some bottlenecks in my code and profiled
> > > it using jvisualvm. The application is spending most of the time in
> > > DefaultIndexChain.processField i.e. 53% of total time.
> > >
> > >
> > > Following is the split of CPU usage in this application:
> > > 1. reading data from disk is taking 5% of total duration
> > > 2. adding document is taking 93% of total duration.
> > >
> > >    -    postUpdate  -> 12.8%
> > >    -    doAfterDocument -> 20.6%
> > >    -    updateDocument  -> 59.8%
> > >       - finishDocument -> 1.7%
> > >       - finishStoreFields -> 4.8%
> > >       - processFields -> 53.1%
> > >
> > >
> > > I'm also attaching the screen shot of call graph generated by jvisualvm.
> > >
> > > I've taken care of following points:
> > > 1. create only one instance of indexWriter
> > > 2. create only one instance of document and reuse it through out the life
> > > time of application
> > > 3. There will be no update in the documents hence only addDocument is
> > > invoked.
> > > Note: After going through the code I found out that addDocument is
> > > internally calling updateDocument only. Is there any way by which we can
> > > avoid calling updateDocument and only use addDocument API?
> > > 4. Using setValue APIs to set the pre created fields and reusing these
> > > fields to create indexes.
> > >
> > > Any tip to improve the performance will be immensely appreciated.
> > >
> > > Regards,
> > > Sandeep
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Profiling lucene 5.2.0 based tool

Reply via email to