Hello Graylog Users, We're seeing a strange issue with our Graylog deployment. Things generally seem to work fine, except that on average once a day our ElasticSearch cluster will go yellow or red. We have our nodes distributed across two datacenters and the issue seems to happen following a brief network partition, at which point shards are dropped from an index. When their node re-joins the cluster the shards never reach 'active' state and we see errors like:
Caused by: java.lang.IllegalStateException: try to recover [graylog_208][2] from primary shard with sync id but number of docs differ... We can manually recover by dropping all replicas and then bringing them back up or by manually rerouting a primary shard to a node, but if left be it will never recover on its own. Googling around seems to indicate this is a known issue in older versions of ElasticSearch with network partitioning (e.g. here <https://github.com/elastic/elasticsearch/issues/12661> and here <https://github.com/elastic/elasticsearch/issues/7572>) but that it should be fixed (or at least improved) as of 5.0, but this doesn't help us when using Graylog and I can't find any mention of a backport. My specific question here is: Have others run into this issue and overcome it using Graylog/ElasticSearch 2.3? The ability to distribute across datacenters is important for us and I imagine much used by other people, so I have a hard time believing we're just that unlucky. Any ideas/pointers/help would be much appreciated. Thanks, Kellen -- You received this message because you are subscribed to the Google Groups "Graylog Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/graylog2/71899341-9ff6-4711-949d-4fba86da7f19%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
