Probably using a queue like RabbitMQ between Spark and ES could help - to
buffer the Spark output when ES can't keep up.

Some links:

1. ES-RabbitMQ River -
https://github.com/elastic/elasticsearch-river-rabbitmq/blob/master/README.md
2. Using RabbitMQ with ELK -
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-rabbitmq.html

I have not tried this out myself, just offering what I have come across as
a solution to a typical problem.

On Wed, Jan 18, 2017 at 10:07 AM, Koert Kuipers <ko...@tresata.com> wrote:

> in our experience you can't really.
> there are some settings to make spark wait longer before retrying when es
> is overloaded, but i have never found them too much use.
>
> check out these settings, maybe they are of some help:
> es.batch.size.bytes
> es.batch.size.entries
> es.http.timeout
> es.batch.write.retry.count
> es.batch.write.retry.wait
>
>
> On Tue, Jan 17, 2017 at 10:13 PM, Russell Jurney <russell.jur...@gmail.com
> > wrote:
>
>> How can I throttle Spark as it writes to Elasticsearch? I have already
>> repartitioned down to one partition in an effort to slow the writes. ES
>> indicates it is being overloaded, and I don't know how to slow things down.
>> This is all on one r4.xlarge EC2 node that runs Spark with 25GB of RAM and
>> ES as well.
>>
>> The script: https://github.com/rjurney/Agile_Data_Code_2/blob/
>> master/ch04/pyspark_to_elasticsearch.py
>>
>> The error: https://gist.github.com/rjurney/ec0d6b1ef050e3fbead23
>> 14255f4b6fa
>>
>> I asked the question on the Elasticsearch forums and I thought someone
>> here might know: https://discuss.elastic.co/t/s
>> park-elasticsearch-exception-maybe-es-was-overloaded/71932
>>
>> Thanks!
>> --
>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>>
>
>

Reply via email to