If you're writing to s3, want to avoid small files, and don't actually need 3 minute latency... you may want to consider just running a regular spark job (using KafkaUtils.createRDD) at scheduled intervals rather than a streaming job.
On Thu, Oct 29, 2015 at 8:16 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > If you are writing to S3, also make sure that you are using the direct > output committer. I don't have streaming jobs but it helps in my machine > learning jobs. Also, though more partitions help in processing faster, they > do slow down writes to S3. So you might want to coalesce before writing to > S3. > > Regards > Sab > On 29-Oct-2015 6:21 pm, "Afshartous, Nick" <nafshart...@turbine.com> > wrote: > >> < Does it work as expected with smaller batch or smaller load? Could it >> be that it's accumulating too many events over 3 minutes? >> >> Thanks for you input. The 3 minute window was chosen because we write >> the output of each batch into S3. And with smaller batch time intervals >> there were many small files being written to S3, something to avoid. That >> was the explanation of the developer who made this decision (who's no >> longer on the team). We're in the process of re-evaluating. >> -- >> Nick >> >> -----Original Message----- >> From: Adrian Tanase [mailto:atan...@adobe.com] >> Sent: Wednesday, October 28, 2015 4:53 PM >> To: Afshartous, Nick <nafshart...@turbine.com> >> Cc: user@spark.apache.org >> Subject: Re: Spark/Kafka Streaming Job Gets Stuck >> >> Does it work as expected with smaller batch or smaller load? Could it be >> that it's accumulating too many events over 3 minutes? >> >> You could also try increasing the parallelism via repartition to ensure >> smaller tasks that can safely fit in working memory. >> >> Sent from my iPhone >> >> > On 28 Oct 2015, at 17:45, Afshartous, Nick <nafshart...@turbine.com> >> wrote: >> > >> > >> > Hi, we are load testing our Spark 1.3 streaming (reading from Kafka) >> job and seeing a problem. This is running in AWS/Yarn and the streaming >> batch interval is set to 3 minutes and this is a ten node cluster. >> > >> > Testing at 30,000 events per second we are seeing the streaming job get >> stuck (stack trace below) for over an hour. >> > >> > Thanks on any insights or suggestions. >> > -- >> > Nick >> > >> > org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartiti >> > onsToPair(JavaDStreamLike.scala:43) >> > com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum >> > erDriver.runStream(StreamingKafkaConsumerDriver.java:125) >> > com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum >> > erDriver.main(StreamingKafkaConsumerDriver.java:71) >> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j >> > ava:57) >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess >> > orImpl.java:43) >> > java.lang.reflect.Method.invoke(Method.java:606) >> > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(Application >> > Master.scala:480) >> > >> > Notice: This communication is for the intended recipient(s) only and >> may contain confidential, proprietary, legally protected or privileged >> information of Turbine, Inc. If you are not the intended recipient(s), >> please notify the sender at once and delete this communication. >> Unauthorized use of the information in this communication is strictly >> prohibited and may be unlawful. For those recipients under contract with >> Turbine, Inc., the information in this communication is subject to the >> terms and conditions of any applicable contracts or agreements. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For >> > additional commands, e-mail: user-h...@spark.apache.org >> > >> >> Notice: This communication is for the intended recipient(s) only and may >> contain confidential, proprietary, legally protected or privileged >> information of Turbine, Inc. If you are not the intended recipient(s), >> please notify the sender at once and delete this communication. >> Unauthorized use of the information in this communication is strictly >> prohibited and may be unlawful. For those recipients under contract with >> Turbine, Inc., the information in this communication is subject to the >> terms and conditions of any applicable contracts or agreements. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>