I've been experimenting with my configuration for couple of days and gained quite a bit of power through small optimizations, but it may very well be something I'm doing crazy that is causing this problem.
To give a little bit of a background, I am in the early stages of a project that consumes a stream of data in the order of 100,000 per second that requires processing over a sliding window over one day (ideally a week). Spark Streaming is a good candidate but I want to make sure I squash any performance issues ahead of time before I commit. With a 5 second batch size, in 40 minutes, the processing time is also 5 seconds. I see the CPU spikes over two seconds out of five. I assume the sliding window operation is very expensive in this case and that's the root cause of this effect. I should've done a little bit more research before I posted, I just came across a post about an undocumented property spark.streaming.concurrentJobs that I am about to try. I'm still confused how exactly this works with a sliding window where the result of one batch depends on the other. I assume the concurrency can only be achieved up until the window action is executed. Either way, I am going to give this a try and post back here if that doesn't work. Thanks! On Thu, Feb 12, 2015 at 2:55 PM, Arush Kharbanda <ar...@sigmoidanalytics.com > wrote: > It could depend on the nature of your application but spark streaming > would use spark internally and concurrency should be there what is your use > case? > > > Are you sure that your configuration is good? > > > On Fri, Feb 13, 2015 at 1:17 AM, Matus Faro <matus.f...@kik.com> wrote: > >> Hi, >> >> Please correct me if I'm wrong, in Spark Streaming, next batch will >> not start processing until the previous batch has completed. Is there >> any way to be able to start processing the next batch if the previous >> batch is taking longer to process than the batch interval? >> >> The problem I am facing is that I don't see a hardware bottleneck in >> my Spark cluster, but Spark is not able to handle the amount of data I >> am pumping through (batch processing time is longer than batch >> interval). What I'm seeing is spikes of CPU, network and disk IO usage >> which I assume are due to different stages of a job, but on average, >> the hardware is under utilized. Concurrency in batch processing would >> allow the average batch processing time to be greater than batch >> interval while fully utilizing the hardware. >> >> Any ideas on what can be done? One option I can think of is to split >> the application into multiple applications running concurrently and >> dividing the initial stream of data between those applications. >> However, I would have to lose the benefits of having a single >> application. >> >> Thank you, >> Matus >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > > > -- > > [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> > > *Arush Kharbanda* || Technical Teamlead > > ar...@sigmoidanalytics.com || www.sigmoidanalytics.com >