If you're writing to s3, want to avoid small files, and don't actually need
3 minute latency... you may want to consider just running a regular spark
job (using KafkaUtils.createRDD) at scheduled intervals rather than a
streaming job.

On Thu, Oct 29, 2015 at 8:16 AM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> If you are writing to S3, also make sure that you are using the direct
> output committer. I don't have streaming jobs but it helps in my machine
> learning jobs. Also, though more partitions help in processing faster, they
> do slow down writes to S3. So you might want to coalesce before writing to
> S3.
>
> Regards
> Sab
> On 29-Oct-2015 6:21 pm, "Afshartous, Nick" <nafshart...@turbine.com>
> wrote:
>
>> < Does it work as expected with smaller batch or smaller load? Could it
>> be that it's accumulating too many events over 3 minutes?
>>
>> Thanks for you input.  The 3 minute window was chosen because we write
>> the output of each batch into S3.  And with smaller batch time intervals
>> there were many small files being written to S3, something to avoid.  That
>> was the explanation of the developer who made this decision (who's no
>> longer on the team).   We're in the process of re-evaluating.
>> --
>>      Nick
>>
>> -----Original Message-----
>> From: Adrian Tanase [mailto:atan...@adobe.com]
>> Sent: Wednesday, October 28, 2015 4:53 PM
>> To: Afshartous, Nick <nafshart...@turbine.com>
>> Cc: user@spark.apache.org
>> Subject: Re: Spark/Kafka Streaming Job Gets Stuck
>>
>> Does it work as expected with smaller batch or smaller load? Could it be
>> that it's accumulating too many events over 3 minutes?
>>
>> You could also try increasing the parallelism via repartition to ensure
>> smaller tasks that can safely fit in working memory.
>>
>> Sent from my iPhone
>>
>> > On 28 Oct 2015, at 17:45, Afshartous, Nick <nafshart...@turbine.com>
>> wrote:
>> >
>> >
>> > Hi, we are load testing our Spark 1.3 streaming (reading from Kafka)
>> job and seeing a problem.  This is running in AWS/Yarn and the streaming
>> batch interval is set to 3 minutes and this is a ten node cluster.
>> >
>> > Testing at 30,000 events per second we are seeing the streaming job get
>> stuck (stack trace below) for over an hour.
>> >
>> > Thanks on any insights or suggestions.
>> > --
>> >      Nick
>> >
>> > org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartiti
>> > onsToPair(JavaDStreamLike.scala:43)
>> > com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
>> > erDriver.runStream(StreamingKafkaConsumerDriver.java:125)
>> > com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
>> > erDriver.main(StreamingKafkaConsumerDriver.java:71)
>> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> > ava:57)
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> > orImpl.java:43)
>> > java.lang.reflect.Method.invoke(Method.java:606)
>> > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(Application
>> > Master.scala:480)
>> >
>> > Notice: This communication is for the intended recipient(s) only and
>> may contain confidential, proprietary, legally protected or privileged
>> information of Turbine, Inc. If you are not the intended recipient(s),
>> please notify the sender at once and delete this communication.
>> Unauthorized use of the information in this communication is strictly
>> prohibited and may be unlawful. For those recipients under contract with
>> Turbine, Inc., the information in this communication is subject to the
>> terms and conditions of any applicable contracts or agreements.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
>> > additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> Notice: This communication is for the intended recipient(s) only and may
>> contain confidential, proprietary, legally protected or privileged
>> information of Turbine, Inc. If you are not the intended recipient(s),
>> please notify the sender at once and delete this communication.
>> Unauthorized use of the information in this communication is strictly
>> prohibited and may be unlawful. For those recipients under contract with
>> Turbine, Inc., the information in this communication is subject to the
>> terms and conditions of any applicable contracts or agreements.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to