Re: Question about basic indexing performance improvements

Erick Erickson Sun, 18 Feb 2007 07:23:00 -0800

Mike:

It's Sunday morning and I felt like typing for a while, but the long rant
can be summarized as follows...


1> be sure you know where you're spending your time before you implement a
complex solution to the problem. I've been surprised many times that the
actual problem was nowhere near where I thought it was. Added complexity for
speed sometimes (maybe even often) creates more problems than it solves. You
can do this with some simple timings and/or a profiler. Simple timings
should take no more than a couple of hours to generate and will allow you to
proceed with confidence.

2> Keep whatever solution you implement as simple as possible.  At every
step in a complex solution, you'll introduce costs. Maintenance costs, time
costs, dissatisfied customer costs. Especially be sure that you don't get
caught up in making your solution as good as it can be. The most wonderful,
elegant solutions imaginable usually get thrown out pretty soon anyway,
since the problem changes. Good enough to last until they are thrown out is
the most efficient in terms of time spent.... Of course, your professor is
only spending your time <G>....

3> Define the goal. Really ask yourself what is "good enough" and shoot for
that. In this case, the first question I'd ask is "why are 1,000,000
documents necessary?" If you can convince your professor that 100,000
documents is enough for demonstration purposes, your problem is already
solved and you can generate your indexes in 3 hours.

If you're interested in the background, see below...

Best
Erick


**********long rant starts here************

But before jumping to the multi-threaded solution, I'd *strongly* advise
that you put verify what's taking all the time. I can't tell you how many
times over the last 25 years I've been wrong about where the inefficiencies
were. I once (in C++) cut the execution time by 75% by something like the
following....

while (condition) {
   string str = anotherstring + fragment;
   anotherstring = str;
}

to

String anotherstring = new String(10000);
while (condition) {
    anotherstring += str;
}


or something like that (look, it's been 10 years, OK<G>). The point is that
I never would have found that line of code without a profiler, and all the
other optimizations I could have made wouldn't have improved performance by
even 1/3 that amount. At best, I would have changed the code unintentionally
and removed the aforementioned ugly code by accident. Yes, it's stupid code.
And yes, I wrote the original code in a moment of inattention. You will do
things equally stupid, trust me <G>..... But are you sure such silliness
isn't happening to your program? If you're sure, how? Look at some code you
wrote 6 months ago and tell me it's all good <G>.....

So, unless you are absolutely sure where the time is being spent, do a few
experiments. You can do the first pass in, maybe, 2 hours... You can start
with just a few timers in your code. For instance, total the time you spend
reading the files. The time spent parsing the XML. The time you spend
actually adding the files to the index. Just use simple timers like
System.currentTimeMilliseconds().

For instance, let's say your code is organized as follows, and you change
the for loop to just index 100 documents, and the run takes 100 seconds.

for each file
  read the file from disk
  parse the xml
  add the fields to a Lucene document
  index the document

Start with something crude and just time each major step. If your intuition
is correct and all the time is being spent in the index step, you're done.
If it's in the last two steps, you're also done and need to jump to think
about multiple thread perhaps. But what if it's in the xml parsing step? You
might be waaay better off using a more efficient parser (and differing
implementations vary WIDELY on how fast they are).

The seductive thing about, say, introducing multi-threading to a solution is
that it'll work. By definition. Throwing 10 threads at this problem will
index things in 3 hours and you can say "See, I solved the problem". It's
possible that no, you did not. What you really did was make the problem
waaaaay more complex (and create endless hassles for yourself in the future)
without really solving the root issue.

Look, it's Sunday morning and I'm avoiding some tasks I need to do, so
pardon the long post. But I'm on a roll <G>. If you've followed this list,
you'll see that this is a recurring theme of mine. The reason is that I've
seen way too many situations where the total cost of a complex solution is
never thought of by the programmers. Let's say you create a 10 thread
solution. Let's further assert that you run each thread on a different
machine. You've now added lots of extra steps to generating your index. And
I'll absolutely guarantee that you (or someone) will spend time debugging
the process of merging all the indexes. Not to mention the times one process
doesn't complete before you try to merge the indexes. Or the time you forgot
to start the third machine, combined the resulting indexes and had anomalous
results that you had to spend 2 days figuring out. Or the time..... And it's
a most uncomfortable thing to have to say "I don't know what was wrong, but
recreating the index fixed it"....

And, finally (thank God, he says), be clear on what's "good enough"
performance. You say it takes 30 hours to create the index. Is 12 hours
acceptable? If you could get a single-threaded operation down to 12 hours,
the simplicity may be enough to make up for the fact that you can't generate
an index before tomorrow. You can always generate test indexes of just a few
documents for debugging purposes. There's always a tendency to say "but if I
could get it down to 4 hours, I could generate an index twice a day". But
this is probably irrelevant. To debug programs and test results, you must be
able to generate an index in 10 minutes at the outside.  Anything longer is
unacceptable since you'll spend all your time waiting. 15 seconds is even
better. That implies that you need a way to limit the number of documents
you index for development purposes and then, when you're satisfied, kick off
the rebuild of the entire index.

My point is that in my experience, it's unlikely that you'll be building
more than one complete index a day anyway. And kicking the build of the
evening before and using it the next morning may be good enough. And a lot
less prone to error if you can keep the index generation simple.

Anyway, it's off to do some work around the house. Oh, wait. I can read the
Sunday paper first.....

Best
Erick


On 2/17/07, Mike O'Leary <[EMAIL PROTECTED]> wrote:


I am taking a class in which the professor has assigned a project to take
a
question answering application that was submitted by a team of students to
one of the TREC contests last year and turn it into a teaching tool. One
thing he wants to have done is add the capability for students to create a
variety of indexes with different settings in order to observe the ways in
which selecting a different index can cause the results to vary. The
application searches over a specified set of just over a million
XML-formatted documents that doesn't change, so there are no requirements
at
this point for adding and deleting documents. Because the team that
created
the application last year only needed to index it once (after they figured
out what parameters they wanted), they didn't need to care very much that
it
took around 30 hours to index the documents one by one using a single
threaded indexing program.



Now we want to be able to index that same set of documents in much less
time. I am new to Lucene, so I am just going by what I have found so far
in
the Lucene in Action book and on the internet. The section in the book on
indexing concurrency says that you can share an IndexWriter object among
several threads and that the calls from these threads will be properly
synchronized. Will this in itself improve indexing performance very much?
It
seems like the synchronization that is needed for keeping the index from
being corrupted would limit how much you gain from using several threads.
In
any case, my overall question is, given an indexing task of this kind,
where
you don't have to worry about additions, deletions and updates of the
documents being indexed, just indexing the whole document database as a
batch each time a user wants to index it in a different way, what would be
the fastest way to do it using the various Lucene indexing tools and
features? Thanks.

Mike O'Leary

Re: Question about basic indexing performance improvements

Reply via email to