As far as indexing is concerned, a simple way of tracking a clustered
system, is to create autonomous indecies that report to a central
repository, creating a table in the DB with a row per document ( you
have a unique document ID, right? ) and then a column per server node
(the columns act as an indexing flag P:pending I for indexed and F for
failed) a row is added whenever a new document is added to the
persistent store or tagged for indexing (trigger style) each node reads
the persistent store using it's own scheduler and then proceeds to
collect the XML file for indexing, modifying the server flag afterwards.
Alternatively, and this one is much easier but you will have issues with
atomicity, you can just rsync the xml files to a directory on each
clustered server and then the servers can pick up the files and index
them. quite simply one is a pull architecture and the other is a push
architecture.
Does that help?
Nader Henein
Stephane Bailliez wrote:
Nader Henein wrote:
Our setup is quite similar to yours, but in all honesty, you will
need to do some for of batching on your updates simply because, you
don't want to keep the Index Writter open all the time.
For now, the index writer is closed after each added document. It does
not seem to have such a major overhead compared to keep it open, at
most overhead is 2x in my tests, which is acceptable for now and in
par with other commercial search engines they have been using. My
constraint is basically that the mergeFactor must be 1, but I think
honestly that it will need to be relaxed when the document rate will
increase.
There were no tuning yet.
I have also a quite specific document lifecycle. Incoming documents
are 5-10KB xml where I'm only extracting 0.5-1KB data to be indexed.
These documents NEVER change. They are not updated, nor deleted.
They are only deleted for archiving purposes because we keep only the
last 6-months of data.
As for clustering, we went through three iterations, that keep x
indexes parallelized on x servers all of this with fail over and
index independent synchronization with your persistent store. There
was a little discussion about this a few weeks back, and I mentioned
that your biggest pain will be maintaining the integrity of parallel
indexes that are updated/deleted autonomously (atomic updates and
deletes) but there are ways of running iterative checks to make sure
that your indecies stay clean.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---
--
Nader S. Henein
Senior Applications Architect
Bayt.com
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]