Re: spark.streaming.kafka.maxRatePerPartition for direct stream

Tathagata Das Mon, 05 Oct 2015 18:07:01 -0700

Also, the backpressure configuration only applies to Spark 1.5 and above.
Just making that clear.


On Fri, Oct 2, 2015 at 6:55 AM, Cody Koeninger <c...@koeninger.org> wrote:

> But turning backpressure on won't stop you from choking on the first batch
> if you're doing e.g. some kind of in-memory aggregate that can't handle
> that many records at once.
>
> On Fri, Oct 2, 2015 at 1:10 AM, Sourabh Chandak <sourabh3...@gmail.com>
> wrote:
>
>> Thanks Cody, will try to do some estimation.
>>
>> Thanks Nicolae, will try out this config.
>>
>> Thanks,
>> Sourabh
>>
>> On Thu, Oct 1, 2015 at 11:01 PM, Nicolae Marasoiu <
>> nicolae.maras...@adswizz.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> Set 10ms and spark.streaming.backpressure.enabled=true
>>>
>>>
>>> This should automatically delay the next batch until the current one is
>>> processed, or at least create that balance over a few batches/periods
>>> between the consume/process rate vs ingestion rate.
>>>
>>>
>>> Nicu
>>>
>>> ------------------------------
>>> *From:* Cody Koeninger <c...@koeninger.org>
>>> *Sent:* Thursday, October 1, 2015 11:46 PM
>>> *To:* Sourabh Chandak
>>> *Cc:* user
>>> *Subject:* Re: spark.streaming.kafka.maxRatePerPartition for direct
>>> stream
>>>
>>> That depends on your job, your cluster resources, the number of seconds
>>> per batch...
>>>
>>> You'll need to do some empirical work to figure out how many messages
>>> per batch a given executor can handle.  Divide that by the number of
>>> seconds per batch.
>>>
>>>
>>>
>>> On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak <sourabh3...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am writing a spark streaming job using the direct stream method for
>>>> kafka and wanted to handle the case of checkpoint failure when we'll have
>>>> to reprocess the entire data from starting. By default for every new
>>>> checkpoint it tries to load everything from each partition and that takes a
>>>> lot of time for processing. After some searching found out that there
>>>> exists a config spark.streaming.kafka.maxRatePerPartition which can be used
>>>> to tackle this. My question is what will be a suitable range for this
>>>> config if we have ~12 million messages in kafka with maximum message size
>>>> ~10 MB.
>>>>
>>>> Thanks,
>>>> Sourabh
>>>>
>>>
>>>
>>
>

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

Reply via email to