Re: Index Replication / Clustering

Nader Henein Sun, 26 Jun 2005 04:54:03 -0700

As far as indexing is concerned, a simple way of tracking a clusteredsystem, is to create autonomous indecies that report to a centralrepository, creating a table in the DB with a row per document ( youhave a unique document ID, right? ) and then a column per server node(the columns act as an indexing flag P:pending I for indexed and F forfailed) a row is added whenever a new document is added to thepersistent store or tagged for indexing (trigger style) each node readsthe persistent store using it's own scheduler and then proceeds tocollect the XML file for indexing, modifying the server flag afterwards.Alternatively, and this one is much easier but you will have issues withatomicity, you can just rsync the xml files to a directory on eachclustered server and then the servers can pick up the files and indexthem. quite simply one is a pull architecture and the other is a pusharchitecture.


Does that help?


Nader Henein

Stephane Bailliez wrote:

Nader Henein wrote:
Our setup is quite similar to yours, but in all honesty, you willneed to do some for of batching on your updates simply because, youdon't want to keep the Index Writter open all the time.
For now, the index writer is closed after each added document. It doesnot seem to have such a major overhead compared to keep it open, atmost overhead is 2x in my tests, which is acceptable for now and inpar with other commercial search engines they have been using. Myconstraint is basically that the mergeFactor must be 1, but I thinkhonestly that it will need to be relaxed when the document rate willincrease.
There were no tuning yet.
I have also a quite specific document lifecycle. Incoming documentsare 5-10KB xml where I'm only extracting 0.5-1KB data to be indexed.These documents NEVER change. They are not updated, nor deleted.
They are only deleted for archiving purposes because we keep only thelast 6-months of data.
As for clustering, we went through three iterations, that keep xindexes parallelized on x servers all of this with fail over andindex independent synchronization with your persistent store. Therewas a little discussion about this a few weeks back, and I mentionedthat your biggest pain will be maintaining the integrity of parallelindexes that are updated/deleted autonomously (atomic updates anddeletes) but there are ways of running iterative checks to make surethat your indecies stay clean.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---


--

Nader S. Henein
Senior Applications Architect

Bayt.com

---


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index Replication / Clustering

Reply via email to