[Dovecot] Solr clusters

Timo Sirainen Wed, 06 Nov 2013 19:37:27 -0800

Hi all,

Has anyone implemented larger Dovecot+Solr clusters and would be willing to 
give some details about how it works for you? My understanding about it so far 
is:

- SolrCloud isn’t usable with Dovecot. Replication isn’t useful, because
nobody wants to pay for double the disk space for indexes that could be
regenerated anyway. The autosharding isn’t very useful also, because: I think
the shard keys could be created in two possible ways: a) Mails would be
entirely randomly distributed across the cluster. This would make updates
efficient, because the writes would be fully distributed across all servers.
But I think it would also make reads somewhat inefficient, since all the
servers would have to be searched and the results combined. Also if a server is
lost, there’s no easy way to reindex back the missing data, because it would
contain a piece of pretty much all the users’ data. b) Shard keys could be
created so that the same user would typically go only to 1-2 servers. It would
be possible (at least in theory) to find a broken server’s list of users and
reindex only their data, but I’m not sure if this method is any easier than the
non-SolrCloud setup.

- Without SolrCloud you’d then need to shard the data manually. This would be
easy enough to do by just assigning different users to different shards. But at
some point the index is going to become too large and you need to add more
shards and move some existing users to them. To keep the search performance
good during the move, I guess this could be done with a script that does: 1)
reindex user to new shard, 2) update userdb to point to new shard, 3) delete
user from old shard, 4) doveadm fts rescan the user to remove any mails already
deleted during the reindexing.

- It seems that Solr index shouldn’t grow above 200 GB or the performance will
be getting too bad? I’ve seen this in a few web pages. So each server should
likely be running multiple separate Solr instances (shards).

-
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-500000-volumes-5-million-volumes-and-beyond
recommends NFS (or I guess any kind of a shared filesystem), which does seem
to make sense. Since Dovecot wants to get instantly updated search results
after indexing, I think it’s probably better not to separate the indexing and
searching servers.

- Would be interesting to know what kind of hardware your Solr servers
currently have, how well they’re performing and what are the bottlenecks? From
the above URL it appears that disk I/O is first, but if there’s enough of that
available then CPU usage is second. I’m not quite sure where most of the memory
goes - caching?

- I’m guessing users are doing relatively few searches compared to how many
new emails are being indexed/deleted all the time?

[Dovecot] Solr clusters

Reply via email to