On Thu, 2009-10-22 at 15:14 +0200, Erick Erickson wrote:
> Besides the other suggestions, I'd really, really, really put
> some instrumentationin the code and see where you're spending your time. For
> a fast hint, put
> a cumulative timer around your indexing part only. This will indicate
> whethe
All previous suggestions are very good.
It's usually just the database. Lucene itself are faster enough.
Previously when I used Pentium III years ago, the indexing speed matters.
But upgrading the CPU to Xeon etc, the indexing bottle neck is on
database side.
Basically use the simplest SQL as
This is basically what LuSql does. The time increases ("8h to 30 min")
are similar. Usually on the order of an order of magnitude.
Oh, the comments suggesting most of the interaction is with the
database? The answer is: it depends.
With large Lucene documents: Lucene is the limiting factor (worsen
Profile your application first hand and find out where the bottlenecks really
are during indexing.
For me it was clearly the database calls which took most of the time. Due to a
very complex SQL Query.
I applied the Producer - Consumer pattern and put a blocking queue in between. I
have a threadpo
Hi Paul:
Mostly of the time indexing big tables is spent on the table full
scan and network data transfer.
Please take a quick look at my OOW08 presentation about Oracle
Lucene integration:
http://docs.google.com/present/view?id=ddgw7sjp_156gf9hczxv
specially slides 13 and 14 wh
Glen Newton wrote:
You might want to consider using LuSql, which is a high performance,
multithreaded, well documented tool designed specifically for moving
data from a JDBC database into Lucene (you didn't say if it was a
JDBC-accessible db...)
http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswik
Besides the other suggestions, I'd really, really, really put
some instrumentationin the code and see where you're spending your time. For
a fast hint, put
a cumulative timer around your indexing part only. This will indicate
whether
the time is consumed in querying your database or indexing..
See also http://wiki.apache.org/lucene-java/ImproveIndexingSpeed.
That includes some info on merge and buffer factors, and recommends
multiple threads. When I've done this sort of thing in the past it
has tended to be the database that is the problem, but maybe your
database is faster than mine.
You might want to consider using LuSql, which is a high performance,
multithreaded, well documented tool designed specifically for moving
data from a JDBC database into Lucene (you didn't say if it was a
JDBC-accessible db...)
http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
Di
I'm building a lucene index from a database, creating 1 about 1 million
documents, unsuprisingly this takes quite a long time.
I do this by sending a query to the db over a range of ids , (10,000)
records
Add these results in Lucene
Then get next 10, and so on.
When completed indexing I the
10 matches
Mail list logo