Re: Spark streaming on standalone cluster

Borja Garrido Bear Wed, 01 Jul 2015 02:25:54 -0700

Hi all,

Thanks for the answers, yes, my problem was I was using just one worker
with one core, so it was starving and then I never get the job to run, now
it seems it's working properly.


One question, is this information in the docs? (because maybe I misread it)

On Wed, Jul 1, 2015 at 10:30 AM, <[email protected]> wrote:

>  Spark streaming needs at least two threads on the worker/slave side. I
> have seen this issue when(to test the behavior), I set the thread count for
> spark streaming to 1. It should be atleast 2: one for the receiver
> adapter(kafka, flume etc) and the second for processing the data.
>
>
>
> But I tested that in local mode: “--master local[2] “. The same issue
> could happen in worker also.  If you set “--master local[1] “ the streaming
> worker/slave blocks due to starvation.
>
>
>
> Which conf parameter sets the worker thread count in cluster mode ? is it
> spark.akka.threads ?
>
>
>
> *From:* Tathagata Das [mailto:[email protected]]
> *Sent:* 01 July 2015 01:32
> *To:* Borja Garrido Bear
> *Cc:* user
> *Subject:* Re: Spark streaming on standalone cluster
>
>
>
> How many receivers do you have in the streaming program? You have to have
> more numbers of core in reserver by your spar application than the number
> of receivers. That would explain the receiving output after stopping.
>
>
>
> TD
>
>
>
> On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear <[email protected]>
> wrote:
>
>  Hi all,
>
>
>
> I'm running a spark standalone cluster with one master and one slave
> (different machines and both in version 1.4.0), the thing is I have a spark
> streaming job that gets data from Kafka, and the just prints it.
>
>
>
> To configure the cluster I just started the master and then the slaves
> pointing to it, as everything appears in the web interface I assumed
> everything was fine, but maybe I missed some configuration.
>
>
>
> When I run it locally there is no problem, it works.
>
> When I run it in the cluster the worker state appears as "loading"
>
>  - If the job is a Scala one, when I stop it I receive all the output
>
>  - If the job is Python, when I stop it I receive a bunch of these
> exceptions
>
>
>
>
> \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
>
>
>
> ERROR JobScheduler: Error running job streaming job 1435675420000 ms.0
>
> py4j.Py4JException: An exception was raised by the Python Proxy. Return
> Message: null
>
> at py4j.Protocol.getReturnValue(Protocol.java:417)
>
> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
>
> at com.sun.proxy.$Proxy14.call(Unknown Source)
>
> at
> org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
>
> at
> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
>
> at
> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
>
>
>
> Is there any known issue with spark streaming and the standalone mode? or
> with Python?
>
>
>  The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus transmitted by this email.
> www.wipro.com
>

Re: Spark streaming on standalone cluster

Reply via email to