Le 2 mars 08 à 03:05, 仇寅 a écrit :

Hi,

I agree with your point that it is easier to partition index by document. But the partition-by-keyword approach has much greater scalability over the
partition-by-document approach. Each query involves communicating with
constant number of nodes; while partition-by-doc requires spreading the query a long all or many of the nodes. And I am actually doing some small
research on this. By the way, the documents to be indexed are not
necessarily web pages. They are mostly files stored on each node's file
system.

Node failures are also handled by replicas. The index for each term will be replicated on multiple nodes, whose nodeIDs are near to each other. This
mechanism is handled by the underlying DHT system.

So any idea how can partition index by keyword in lucene? Thanks.

When you read a file, and tokenize it, you dispatch token in differents index, with a unique Document ID.

Can you explain more things about the context of your application?

I don't know why you need P2P. Is it for file sharing? so, index should be near document.
Is it for distributed computed? use central data and hadoop Map/Reduce.
If you wont a cluster of lucene for heavy querying, use the rsync + mv trick of Technorati. If you persist with Term dispatching, use it only for caching. Each node provides a Term index of their Document. When you search something, the parsed query gives you every Term (I can give you code for that), you first ask wich node contains that Term, and after, you send the Query to this nodes.

M.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to