I've been experimenting with my configuration for couple of days and gained
quite a bit of power through small optimizations, but it may very well be
something I'm doing crazy that is causing this problem.

To give a little bit of a background, I am in the early stages of a project
that consumes a stream of data in the order of 100,000 per second that
requires processing over a sliding window over one day (ideally a week).
Spark Streaming is a good candidate but I want to make sure I squash any
performance issues ahead of time before I commit.

With a 5 second batch size, in 40 minutes, the processing time is also 5
seconds. I see the CPU spikes over two seconds out of five. I assume the
sliding window operation is very expensive in this case and that's the root
cause of this effect.

I should've done a little bit more research before I posted, I just came
across a post about an undocumented property spark.streaming.concurrentJobs
that I am about to try. I'm still confused how exactly this works with a
sliding window where the result of one batch depends on the other. I assume
the concurrency can only be achieved up until the window action is
executed. Either way, I am going to give this a try and post back here if
that doesn't work.

Thanks!



On Thu, Feb 12, 2015 at 2:55 PM, Arush Kharbanda <ar...@sigmoidanalytics.com
> wrote:

> It could depend on the nature of your application but spark streaming
> would use spark internally and concurrency should be there what is your use
> case?
>
>
> Are you sure that your configuration is good?
>
>
> On Fri, Feb 13, 2015 at 1:17 AM, Matus Faro <matus.f...@kik.com> wrote:
>
>> Hi,
>>
>> Please correct me if I'm wrong, in Spark Streaming, next batch will
>> not start processing until the previous batch has completed. Is there
>> any way to be able to start processing the next batch if the previous
>> batch is taking longer to process than the batch interval?
>>
>> The problem I am facing is that I don't see a hardware bottleneck in
>> my Spark cluster, but Spark is not able to handle the amount of data I
>> am pumping through (batch processing time is longer than batch
>> interval). What I'm seeing is spikes of CPU, network and disk IO usage
>> which I assume are due to different stages of a job, but on average,
>> the hardware is under utilized. Concurrency in batch processing would
>> allow the average batch processing time to be greater than batch
>> interval while fully utilizing the hardware.
>>
>> Any ideas on what can be done? One option I can think of is to split
>> the application into multiple applications running concurrently and
>> dividing the initial stream of data between those applications.
>> However, I would have to lose the benefits of having a single
>> application.
>>
>> Thank you,
>> Matus
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>

Reply via email to