Also, the backpressure configuration only applies to Spark 1.5 and above. Just making that clear.
On Fri, Oct 2, 2015 at 6:55 AM, Cody Koeninger <c...@koeninger.org> wrote: > But turning backpressure on won't stop you from choking on the first batch > if you're doing e.g. some kind of in-memory aggregate that can't handle > that many records at once. > > On Fri, Oct 2, 2015 at 1:10 AM, Sourabh Chandak <sourabh3...@gmail.com> > wrote: > >> Thanks Cody, will try to do some estimation. >> >> Thanks Nicolae, will try out this config. >> >> Thanks, >> Sourabh >> >> On Thu, Oct 1, 2015 at 11:01 PM, Nicolae Marasoiu < >> nicolae.maras...@adswizz.com> wrote: >> >>> Hi, >>> >>> >>> Set 10ms and spark.streaming.backpressure.enabled=true >>> >>> >>> This should automatically delay the next batch until the current one is >>> processed, or at least create that balance over a few batches/periods >>> between the consume/process rate vs ingestion rate. >>> >>> >>> Nicu >>> >>> ------------------------------ >>> *From:* Cody Koeninger <c...@koeninger.org> >>> *Sent:* Thursday, October 1, 2015 11:46 PM >>> *To:* Sourabh Chandak >>> *Cc:* user >>> *Subject:* Re: spark.streaming.kafka.maxRatePerPartition for direct >>> stream >>> >>> That depends on your job, your cluster resources, the number of seconds >>> per batch... >>> >>> You'll need to do some empirical work to figure out how many messages >>> per batch a given executor can handle. Divide that by the number of >>> seconds per batch. >>> >>> >>> >>> On Thu, Oct 1, 2015 at 3:39 PM, Sourabh Chandak <sourabh3...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I am writing a spark streaming job using the direct stream method for >>>> kafka and wanted to handle the case of checkpoint failure when we'll have >>>> to reprocess the entire data from starting. By default for every new >>>> checkpoint it tries to load everything from each partition and that takes a >>>> lot of time for processing. After some searching found out that there >>>> exists a config spark.streaming.kafka.maxRatePerPartition which can be used >>>> to tackle this. My question is what will be a suitable range for this >>>> config if we have ~12 million messages in kafka with maximum message size >>>> ~10 MB. >>>> >>>> Thanks, >>>> Sourabh >>>> >>> >>> >> >