Distributed Lucene..

Prasenjit Mukherjee Sun, 05 Mar 2006 22:02:08 -0800

I already have an implementation of a distributed crawler farm, wherecrawler instances are runnign on different boxes. I want to come up witha distributed indexing scheme using lucene and take advantage of thedistributed nature of my crawlers' distributed nature. Here is what I amthinking.

Crawlers will analyze and tokenize the content for every URLs(akaDocuments) and create the following data for every url document:<url-id, <field1, <term-f1-t1,term-f1-t2,term-f1-t3 etc.>> <field-2,<term-f2-t1,term-f2-t2,term-f2-t3, >> ...... >

And then based on some partitioning function the carwlers can send asubset of tokens(aka terms) to the indexing server. The partitioningfunction can be as simple as based on the starting character of theterms. Lets say if we have 5 indexers, we will distribute the indexingdata in the following manner :


Indexer1 - a-e
Indexer2 - f-j
Indexer3 - k-o
Indexer4 - p-t
Indexer5 - u-z

Does it make any sense ? Also would like to know if there are other waysto distribute lucene's indexing/searching ?


thanks,
prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Distributed Lucene..

Reply via email to