I already have an implementation of a distributed crawler farm, where
crawler instances are runnign on different boxes. I want to come up with
a distributed indexing scheme using lucene and take advantage of the
distributed nature of my crawlers' distributed nature. Here is what I am
thinking.
Crawlers will analyze and tokenize the content for every URLs(aka
Documents) and create the following data for every url document:
<url-id, <field1, <term-f1-t1,term-f1-t2,term-f1-t3 etc.>> <field-2,
<term-f2-t1,term-f2-t2,term-f2-t3, >> ...... >
And then based on some partitioning function the carwlers can send a
subset of tokens(aka terms) to the indexing server. The partitioning
function can be as simple as based on the starting character of the
terms. Lets say if we have 5 indexers, we will distribute the indexing
data in the following manner :
Indexer1 - a-e
Indexer2 - f-j
Indexer3 - k-o
Indexer4 - p-t
Indexer5 - u-z
Does it make any sense ? Also would like to know if there are other ways
to distribute lucene's indexing/searching ?
thanks,
prasen
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]