Probably using a queue like RabbitMQ between Spark and ES could help - to buffer the Spark output when ES can't keep up.
Some links: 1. ES-RabbitMQ River - https://github.com/elastic/elasticsearch-river-rabbitmq/blob/master/README.md 2. Using RabbitMQ with ELK - https://www.elastic.co/guide/en/logstash/current/plugins-inputs-rabbitmq.html I have not tried this out myself, just offering what I have come across as a solution to a typical problem. On Wed, Jan 18, 2017 at 10:07 AM, Koert Kuipers <ko...@tresata.com> wrote: > in our experience you can't really. > there are some settings to make spark wait longer before retrying when es > is overloaded, but i have never found them too much use. > > check out these settings, maybe they are of some help: > es.batch.size.bytes > es.batch.size.entries > es.http.timeout > es.batch.write.retry.count > es.batch.write.retry.wait > > > On Tue, Jan 17, 2017 at 10:13 PM, Russell Jurney <russell.jur...@gmail.com > > wrote: > >> How can I throttle Spark as it writes to Elasticsearch? I have already >> repartitioned down to one partition in an effort to slow the writes. ES >> indicates it is being overloaded, and I don't know how to slow things down. >> This is all on one r4.xlarge EC2 node that runs Spark with 25GB of RAM and >> ES as well. >> >> The script: https://github.com/rjurney/Agile_Data_Code_2/blob/ >> master/ch04/pyspark_to_elasticsearch.py >> >> The error: https://gist.github.com/rjurney/ec0d6b1ef050e3fbead23 >> 14255f4b6fa >> >> I asked the question on the Elasticsearch forums and I thought someone >> here might know: https://discuss.elastic.co/t/s >> park-elasticsearch-exception-maybe-es-was-overloaded/71932 >> >> Thanks! >> -- >> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io >> > >