Re: Performance tips when creating a large index from database.

2009-10-27 Thread Toke Eskildsen
On Thu, 2009-10-22 at 15:14 +0200, Erick Erickson wrote: > Besides the other suggestions, I'd really, really, really put > some instrumentationin the code and see where you're spending your time. For > a fast hint, put > a cumulative timer around your indexing part only. This will indicate > whethe

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Chris Lu
All previous suggestions are very good. It's usually just the database. Lucene itself are faster enough. Previously when I used Pentium III years ago, the indexing speed matters. But upgrading the CPU to Xeon etc, the indexing bottle neck is on database side. Basically use the simplest SQL as

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
This is basically what LuSql does. The time increases ("8h to 30 min") are similar. Usually on the order of an order of magnitude. Oh, the comments suggesting most of the interaction is with the database? The answer is: it depends. With large Lucene documents: Lucene is the limiting factor (worsen

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Thomas Becker
Profile your application first hand and find out where the bottlenecks really are during indexing. For me it was clearly the database calls which took most of the time. Due to a very complex SQL Query. I applied the Producer - Consumer pattern and put a blocking queue in between. I have a threadpo

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Marcelo Ochoa
Hi Paul: Mostly of the time indexing big tables is spent on the table full scan and network data transfer. Please take a quick look at my OOW08 presentation about Oracle Lucene integration: http://docs.google.com/present/view?id=ddgw7sjp_156gf9hczxv specially slides 13 and 14 wh

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Paul Taylor
Glen Newton wrote: You might want to consider using LuSql, which is a high performance, multithreaded, well documented tool designed specifically for moving data from a JDBC database into Lucene (you didn't say if it was a JDBC-accessible db...) http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswik

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Erick Erickson
Besides the other suggestions, I'd really, really, really put some instrumentationin the code and see where you're spending your time. For a fast hint, put a cumulative timer around your indexing part only. This will indicate whether the time is consumed in querying your database or indexing..

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Ian Lea
See also http://wiki.apache.org/lucene-java/ImproveIndexingSpeed. That includes some info on merge and buffer factors, and recommends multiple threads. When I've done this sort of thing in the past it has tended to be the database that is the problem, but maybe your database is faster than mine.

Re: Performance tips when creating a large index from database.

2009-10-22 Thread Glen Newton
You might want to consider using LuSql, which is a high performance, multithreaded, well documented tool designed specifically for moving data from a JDBC database into Lucene (you didn't say if it was a JDBC-accessible db...) http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql Di

Performance tips when creating a large index from database.

2009-10-22 Thread Paul Taylor
I'm building a lucene index from a database, creating 1 about 1 million documents, unsuprisingly this takes quite a long time. I do this by sending a query to the db over a range of ids , (10,000) records Add these results in Lucene Then get next 10, and so on. When completed indexing I the