[Spark SQL] Missing data in Elastisearch when writing data with elasticsearch-spark connector

sixers Mon, 09 Oct 2017 06:36:38 -0700

### Issue description

We have an issue with data consistency when storing data in Elasticsearch
using Spark and elasticsearch-spark connector. Job finishes successfully,
but when we compare the original data (stored in S3), with the data stored
in ES, some documents are not present in Elasticsearch.


### Steps to reproduce

This issue doesn't always happen and unfortunately we cannot reproduce it on
demand. The only indicator we found that correlates with occurrences of this
bug, is the presence of the failed stage while saving data in Elasticsearch. 
Jobs which have this stage failure eventually complete successfully, but the
data is inconsistent. 

We use the following configuration:

- Elasticsearch:
  - "es.write.operation": "index"
  - "es.nodes.discovery": "false"
  - "es.nodes.wan.only": "true"
- Spark:
  - write mode: "append"

### Version Info

- OS:         :  Amazon Linux
- JVM         :  1.8
- Hadoop/Spark:  Hadoop 2.7.3 (Amazon), Spark 2.2.0
- ES-Hadoop   :  elasticsearch-spark-20_2.11:5.5.2
- ES          :  5.3 (Amazon Elasticsearch Service).

### Questions

I'm looking for some guidance in order to debug this issue.

1. I want to understand why Elasticsearch doesn't have all the data although
Spark says it finished the job and saved the data?
2. What can we do to ensure that we write data to ES in a consistent manner?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark SQL] Missing data in Elastisearch when writing data with elasticsearch-spark connector

Reply via email to