Hello Graylog Users,

We're seeing a strange issue with our Graylog deployment. Things generally 
seem to work fine, except that on average once a day our ElasticSearch 
cluster will go yellow or red. We have our nodes distributed across two 
datacenters and the issue seems to happen following a brief network 
partition, at which point shards are dropped from an index. When their node 
re-joins the cluster the shards never reach 'active' state and we see 
errors like:

Caused by: java.lang.IllegalStateException: try to recover [graylog_208][2] 
from primary shard with sync id but number of docs differ...
We can manually recover by dropping all replicas and then bringing them 
back up or by manually rerouting a primary shard to a node, but if left be 
it will never recover on its own. Googling around seems to indicate this is 
a known issue in older versions of ElasticSearch with network partitioning 
(e.g. here <https://github.com/elastic/elasticsearch/issues/12661> and here 
<https://github.com/elastic/elasticsearch/issues/7572>) but that it should 
be fixed (or at least improved) as of 5.0, but this doesn't help us when 
using Graylog and I can't find any mention of a backport.

My specific question here is: Have others run into this issue and overcome 
it using Graylog/ElasticSearch 2.3?
The ability to distribute across datacenters is important for us and I 
imagine much used by other people, so I have a hard time believing we're 
just that unlucky.

Any ideas/pointers/help would be much appreciated.

Thanks,
Kellen

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/71899341-9ff6-4711-949d-4fba86da7f19%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to