Re: Best Practices for Distributing Lucene Indexing and Searching

Paul Smith Thu, 14 Jul 2005 19:04:35 -0700

answering my own question: nutch.org -> lucene.apache.org/nutch/


Excellent!

Paul

On 15/07/2005, at 11:45 AM, Paul Smith wrote:

Cooool, I should go have a look at that.. That begs anotherquestion though, where does Nutch stand in terms of the ASF? Did Iread (or dream) that Nutch may be coming in under ASF? I guess Ishould get myself subscribed to the Nutch mailing lists.
thanks Erik.

Paul

On 15/07/2005, at 11:36 AM, Erik Hatcher wrote:
Paul - it sounds an awful like my (perhaps incorrect)understanding of the MapReduce capability of Nutch that is underdevelopment. Perhaps the work that Doug and others have donethere are applicable to your situation.
    Erik


On Jul 14, 2005, at 7:38 PM, Paul Smith wrote:
My punt was that having workers create sub-indexs (creating thedocuments and making a partial index) and shipping the partialindex back to the queen to merge may be more efficient. It'sprobably not, I was just using the day as a chance to see if itlooked promising, and get my hands dirty with SEDA (and ActiveMQ,great JMS provider). The bit I forgot about until too late wasthe merge is going to have to be 'serialized' by the queen,there's no real way to get around that I could think of. Worse,because the .merge(Directory[]) method does an automatic optimizeI had to get the Queen to temporarily store the partial indexeslocally until it had all results before doing a merge, otherwisethe queen would be spending _forever_ doing merges.
If you had the workers just creating Documents, I'm not sure thatwould get over the overhead of the SEDA network/jms transport.The document doesn't really get inverted until it's sent to theWriter? In this example the Queen would be doing 99% of the workwouldn't it? (defeating the purpose of SEDA).
What I was aiming at was an easily growable, self-healingIndexing network. Need more capacity? Throw in a Mac-Mini orsomething, and let it join as a worker via Zeroconf. Workerdies? work just gets done somewhere else. True Google-like model.
But I don't think I quite got there.
Re: document in multiple indexes. The queen was the co-ordinator, so it's first task was to work out what entities toindex (everything in our app can be referenced via a URI, mail://2358 or doc://9375). The queen works out all the URI's to index,batches them up into a work-unit and broadcasts that as a workmessage on the work queue. The worker takes that package,creates a relatively small index (5000-10000 documents, whateverthe chunk size is), and ships the directory as a zipped up binarymessage back onto the Result queue. Therefore there were neverany duplicates. A worker either processed the work unit, or itdidn't.
Paul

On 15/07/2005, at 9:28 AM, Otis Gospodnetic wrote:
Interesting.  I'm planning on doing something similar for some new
Simpy features. Why are your worker bees sending whole indicesto the
Queen bee?  Wouldn't it be easier to send in Documents and have the
Queen index them in the same index? Maybe you need thoseindividual,
smaller indices to be separate....
How do you deal with the possibility of the same Document beingpresent
in multiple indices?

Otis


--- Paul Smith <[EMAIL PROTECTED]> wrote:
I had a crack at whipping up something along this lines during a 1
day hackathon we held here at work, using ActiveMQ as the busbetween
the 'co-ordinator' (Queen bee) and the 'worker" bees. Theindex work
was segmented as jobs on a work queue, and the workers feed the
relatively smal index chunks back along a result queue, whichthe co-
ordinator then merged in.
The tough part from observing the outcome is knowing what thechunk
size should be, because in the end the co-ordinator needs to merge
all the sub-indexes together into 1 and for a large indexthat's not
an insignificant time. You also have to use bookkeeping towork out
if a 'job' has not been completed in time (maybe failure by the
worker) and decide whether the job should be resubmitted (intheory
JMS with transactions would help there, but then you have a
throughput problem on that too).
Would love to see something like this work really well, andperhaps
generalize it a bit more.  I do like the simplicity of the SEDA
principles.

cheers,

Paul Smith


On 14/07/2005, at 11:50 PM, Peter Gelderbloem wrote:
I am currently looking into building a similar system and came
across
this architecture:
http://www.eecs.harvard.edu/~mdw/proj/seda/

I am just reading up on it now. Does anyone have experience
building a
lucene system based on this architecture? Any advice would be
greatly
appreciated.

Peter Gelderbloem

   Registered in England 3186704
-----Original Message-----
From: Luke Francl [mailto:[EMAIL PROTECTED]
Sent: 13 May 2005 22:04
To: java-user@lucene.apache.org
Subject: Re: Best Practices for Distributing Lucene Indexing and
Searching

On Tue, 2005-03-01 at 19:23, Chris Hostetter wrote:
I don't really consider reading/writing to an NFS mounted
FSDirectory
to
be viable for the very reasons you listed; but I haven't really
found
any
evidence of problems if you take they approach that a single
"writer"
node indexes to local disk, which is NFS mounted by all of your
other
nodes for doing queries. concurent updates/queries may stillnot
be
safe
(i'm not sure) but you could have the writer node "clone" the
entire
index
into a new directory, apply the updates and then signal theother
nodes to
stop using the old FSDirectory and start using the new one.
Thanks to everyone who contributed advice to my question abouthow
to
distribute a Lucene index across a cluster.

I'm about to start on the implementation and I wanted to clarify
something about using NFS that Chris wrote about above.

There are many warnings about indexing on an NFS file system, but
is it
safe to have a single node index, while the other nodes use the
file
system in read-only mode?
On a related note, our software is cross-platform and needs towork
on
Windows as well. Are there any problems known problems having a
read-only index shared over SMB?
Using a shared file system is preferable to me because it'seasier,
but
if it's necessary I will write the code to copy the index to each
node.

Thanks,
Luke Francl
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best Practices for Distributing Lucene Indexing and Searching

Reply via email to