Hi all,

Has anyone implemented larger Dovecot+Solr clusters and would be willing to 
give some details about how it works for you? My understanding about it so far 
is:

 - SolrCloud isn’t usable with Dovecot. Replication isn’t useful, because 
nobody wants to pay for double the disk space for indexes that could be 
regenerated anyway. The autosharding isn’t very useful also, because: I think 
the shard keys could be created in two possible ways: a) Mails would be 
entirely randomly distributed across the cluster. This would make updates 
efficient, because the writes would be fully distributed across all servers. 
But I think it would also make reads somewhat inefficient, since all the 
servers would have to be searched and the results combined. Also if a server is 
lost, there’s no easy way to reindex back the missing data, because it would 
contain a piece of pretty much all the users’ data. b) Shard keys could be 
created so that the same user would typically go only to 1-2 servers. It would 
be possible (at least in theory) to find a broken server’s list of users and 
reindex only their data, but I’m not sure if this method is any easier than the 
non-SolrCloud setup.

 - Without SolrCloud you’d then need to shard the data manually. This would be 
easy enough to do by just assigning different users to different shards. But at 
some point the index is going to become too large and you need to add more 
shards and move some existing users to them. To keep the search performance 
good during the move, I guess this could be done with a script that does: 1) 
reindex user to new shard, 2) update userdb to point to new shard, 3) delete 
user from old shard, 4) doveadm fts rescan the user to remove any mails already 
deleted during the reindexing.

 - It seems that Solr index shouldn’t grow above 200 GB or the performance will 
be getting too bad? I’ve seen this in a few web pages. So each server should 
likely be running multiple separate Solr instances (shards).

 - 
http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-from-500000-volumes-5-million-volumes-and-beyond
 recommends NFS (or I guess any kind of a shared filesystem), which does seem 
to make sense. Since Dovecot wants to get instantly updated search results 
after indexing, I think it’s probably better not to separate the indexing and 
searching servers.

 - Would be interesting to know what kind of hardware your Solr servers 
currently have, how well they’re performing and what are the bottlenecks? From 
the above URL it appears that disk I/O is first, but if there’s enough of that 
available then CPU usage is second. I’m not quite sure where most of the memory 
goes - caching?

 - I’m guessing users are doing relatively few searches compared to how many 
new emails are being indexed/deleted all the time?

Reply via email to