But turning backpressure on won't stop you from choking on the first batch if you're doing e.g. some kind of in-memory aggregate that can't handle that many records at once.
On Fri, Oct 2, 2015 at 1:10 AM, Sourabh Chandak <sourabh3...@gmail.com> wrote: > Thanks Cody, will try to do some estimation. > > Thanks Nicolae, will try out this config. > > Thanks, > Sourabh > > On Thu, Oct 1, 2015 at 11:01 PM, Nicolae Marasoiu < > nicolae.maras...@adswizz.com> wrote: > >> Hi, >> >> >> Set 10ms and spark.streaming.backpressure.enabled=true >> >> >> This should automatically delay the next batch until the current one is >> processed, or at least create that balance over a few batches/periods >> between the consume/process rate vs ingestion rate. >> >> >> Nicu >> >> ------------------------------ >> *From:* Cody Koeninger <c...@koeninger.org> >> *Sent:* Thursday, October 1, 2015 11:46 PM >> *To:* Sourabh Chandak >> *Cc:* user >> *Subject:* Re: spark.streaming.kafka.maxRatePerPartition for direct >> stream >> >> That depends on your job, your cluster resources, the number of seconds >> per batch... >> >> You'll need to do some empirical work to figure out how many messages per >> batch a given executor can handle. Divide that by the number of seconds >> per batch. >> >> >> >> On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak <sourabh3...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I am writing a spark streaming job using the direct stream method for >>> kafka and wanted to handle the case of checkpoint failure when we'll have >>> to reprocess the entire data from starting. By default for every new >>> checkpoint it tries to load everything from each partition and that takes a >>> lot of time for processing. After some searching found out that there >>> exists a config spark.streaming.kafka.maxRatePerPartition which can be used >>> to tackle this. My question is what will be a suitable range for this >>> config if we have ~12 million messages in kafka with maximum message size >>> ~10 MB. >>> >>> Thanks, >>> Sourabh >>> >> >> >