RE: distributing the indexing process

Guru Chandar Thu, 30 Jun 2011 02:44:14 -0700

Thanks for the response. The documents are all distinct. My (limited) 
understanding on partitioning the indexes will lead to results being different 
from the case where you have all in one partition, due to Lucene currently not 
supporting distributed idf. Is this correct? Is there a way to make it work 
seamlessly?

Regards,
-gc

-----Original Message-----
From: Danil ŢORIN [mailto:torin...@gmail.com] 
Sent: Thursday, June 30, 2011 3:04 PM
To: java-user@lucene.apache.org
Subject: Re: distributing the indexing process

It depends....

If all documents are distinct then, yeah, go for it.

If you have multiple versions of same document in your data and you
only want to index the latest version...then you need a clever way to
split data to make sure that all versions of document will be indexed
on same host, and you won't have duplicates later.

But my biggest concern is: if your index is that big that you need to
index it on different hosts, are you sure you want it to be combine in
a single index?
Maybe it's a good idea to partition it?

On Thu, Jun 30, 2011 at 12:12, Guru Chandar <guru.chan...@consona.com> wrote:
>
>
> If we have to index a lot of documents, is there a way to divide the
> documents into multiple sets and index them on multiple machines in
> parallel, and then merge the resulting indexes back into a single
> machine? If yes, will the result be logically equivalent to indexing all
> the documents on a single machine?
>
>
>
> Thanks,
>
> -gc
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: distributing the indexing process

Reply via email to