Thanks a lot guys. I really appreciate your response on my query. I'll create multiple threads and checkout that how much I can rate can be increased per thread.
Regards, Sandeep On Tue, Feb 23, 2016 at 4:19 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Your profiler breakdown is exactly what I'd expect: processing the > fields is the heaviest part of indexing. > > Except, it doesn't have any merges? Did you run it for long enough? > Note that by default Lucene runs merges in a background thread > (ConcurrentMergeScheduler). If you really must be single thread'd > (why?) then you should use SerialMergeScheduler instead. > > The doAfterDocument is likely the flush time (writing the new segment > once the in-heap indexing buffer is full). > > Finally, if many of your fields are numeric, 6.0 offers some nice > improvements here with the new dimensional points feature. See > https://www.elastic.co/blog/lucene-points-6.0 ... but not 6.0 is not > yet released though it should be soon now. > > Mike McCandless > > http://blog.mikemccandless.com > > > On Tue, Feb 23, 2016 at 2:01 AM, sandeep das <yarnhad...@gmail.com> wrote: > > Hi, > > > > I've implemented a tool using lucene-5.2.0 to index my CSV files. The > tool > > is reading data from CSV files(residing on disk) and creating indexes on > > local disk. It is able to process 3.5 MBps data. There are overall 46 > fields > > being added in one document. They are only of three data types 1. > Integer, > > 2. Long, 3. String. > > All these fields are part of one CSV record and they are parsed using > custom > > CSV parser which is faster than any split method of string. > > > > I've configured the following parameters to create indexWriter > > 1. setOpenMode(OpenMode.CREATE) > > 2. setCommitOnClose(true) > > 3. setRAMBufferSizeMB(512) // Tried 256, 312 as well but performance is > > almost same. > > > > I've read over several blogs that lucene works way faster than these > > figures. So, I thought there are some bottlenecks in my code and > profiled it > > using jvisualvm. The application is spending most of the time in > > DefaultIndexChain.processField i.e. 53% of total time. > > > > > > Following is the split of CPU usage in this application: > > 1. reading data from disk is taking 5% of total duration > > 2. adding document is taking 93% of total duration. > > > > postUpdate -> 12.8% > > doAfterDocument -> 20.6% > > updateDocument -> 59.8% > > > > finishDocument -> 1.7% > > finishStoreFields -> 4.8% > > processFields -> 53.1% > > > > > > I'm also attaching the screen shot of call graph generated by jvisualvm. > > > > I've taken care of following points: > > 1. create only one instance of indexWriter > > 2. create only one instance of document and reuse it through out the life > > time of application > > 3. There will be no update in the documents hence only addDocument is > > invoked. > > Note: After going through the code I found out that addDocument is > > internally calling updateDocument only. Is there any way by which we can > > avoid calling updateDocument and only use addDocument API? > > 4. Using setValue APIs to set the pre created fields and reusing these > > fields to create indexes. > > > > Any tip to improve the performance will be immensely appreciated. > > > > Regards, > > Sandeep > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org >