Spark streaming routing

2016-01-07 Thread Lin Zhao
I have a need to route the dstream through the streming pipeline by some key, such that data with the same key always goes through the same executor. There doesn't seem to be a way to do manual routing with Spark Streaming. The closest I can come up with is: stream.foreachRDD {rdd => rdd.grou

"impossible to get artifacts " error when using sbt to build 1.6.0 for scala 2.11

2016-01-07 Thread Lin Zhao
I tried to build 1.6.0 for yarn and scala 2.11, but have an error. Any help is appreciated. [warn] Strategy 'first' was applied to 2 files [info] Assembly up to date: /Users/lin/git/spark/network/yarn/target/scala-2.11/spark-network-yarn-1.6.0-hadoop2.7.1.jar java.lang.IllegalStateException:

Re: Spark streaming routing

2016-01-07 Thread Lin Zhao
a low cpu:memory ratio. From: Tathagata Das mailto:t...@databricks.com>> Date: Thursday, January 7, 2016 at 1:56 PM To: Lin Zhao mailto:l...@exabeam.com>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Spark streaming routing You cannot guarantee that each key will forever be on

yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Lin Zhao
My job runs fine in yarn cluster mode but I have reason to use client mode instead. But I'm hitting this error when submitting: > spark-submit --class com.exabeam.martini.scripts.SparkStreamingTest --master > yarn --deploy-mode client --executor-memory 90G --num-executors 3 > --executor-cores 1

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Lin Zhao
and >things are failing because of that. > >On Wed, Jan 13, 2016 at 9:31 AM, Lin Zhao wrote: >> My job runs fine in yarn cluster mode but I have reason to use client >>mode >> instead. But I'm hitting this error when submitting: >>> spark-submit --class com.exa

Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Lin Zhao
Hi, I'm testing spark streaming with actor receiver. The actor keeps calling store() to save a pair to Spark. Once the job is launched, on the UI everything looks good. Millions of events gets through every batch. However, I added logging to the first step and found that only 20 or 40 events i

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Lin Zhao
lockManager: Removing RDD 41 16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 40 16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 39 From: "Shixiong(Ryan) Zhu" mailto:shixi...@databricks.com>> Date: Thursday, January 14, 2016 at 4:13 PM To: Lin Zhao mailt

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Lin Zhao
; Date: Thursday, January 14, 2016 at 4:41 PM To: Lin Zhao mailto:l...@exabeam.com>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Spark Streaming: custom actor receiver losing vast majority of data Could you post the codes of MessageRetriever? And by the way, could you post t

Spark Streaming: routing by key without groupByKey

2016-01-15 Thread Lin Zhao
I have requirement to route a paired DStream to a series of map and flatMap such that entries with the same key goes to the same thread within the same batch. Closest I can come up with is groupByKey().flatMap(_._2). But this kills throughput by 50%. When I think about it groupByKey is more tha

Spark Streaming: Does mapWithState implicitly partition the dsteram?

2016-01-17 Thread Lin Zhao
When the state is passed to the task that handles a mapWithState for a particular key, if the key is distributed, it seems extremely difficult to coordinate and synchronise the state. Is there a partition by key before a mapWithState? If not what exactly is the execution model? Thanks, Lin

Re: Spark Streaming: BatchDuration and Processing time

2016-01-22 Thread Lin Zhao
Hi Silvio, Can you go into a little detail how the back pressure work? Does it block the receiver? Or does it temporarily saves the incoming messages in mem/disk? I have a custom actor receiver that uses store() to save dataa to spark. Would the back pressure make store() call block? On 1/17/16,

Streaming: mapWithState "Error during Java deserialization."

2016-01-26 Thread Lin Zhao
I'm using mapWithState, and hit https://issues.apache.org/jira/browse/SPARK-12591. While 1.6.1 is not released, I tried the workaround in the comment. But I had these error in one of the nodes. While millions of events go throught the mapWithState, only 7 show up in the log. Is this related to

Spark streaming flow control and back pressure

2016-01-27 Thread Lin Zhao
I have an actor receiver that reads data and calls "store()" to save data to spark. I was hoping spark.streaming.receiver.maxRate and spark.streaming.backpressure would help me block the method when needed to avoid overflowing the pipeline. But it doesn't. My actor pumps millions of lines to sp

Re: Spark streaming flow control and back pressure

2016-01-27 Thread Lin Zhao
One solution is to read the scheduling delay and my actor can go to sleep if needed. Is this possible? From: Lin Zhao mailto:l...@exabeam.com>> Date: Wednesday, January 27, 2016 at 5:28 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apach

Re: Spark streaming flow control and back pressure

2016-01-28 Thread Lin Zhao
uot;Stored ${storeCount.get()} messages to spark}") } From: Iulian DragoČ™ mailto:iulian.dra...@typesafe.com>> Date: Thursday, January 28, 2016 at 5:33 AM To: Lin Zhao mailto:l...@exabeam.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spar

Streaming: LeaseExpiredException when writing checkpoint

2016-01-28 Thread Lin Zhao
I'm seeing this error in the driver when running a streaming job. Not sure If it's critical. It happens maybe half of time checkpoint is saved. There are retries in the log but sometimes results in "Could not write checkpoint for time 145400632 ms to file hdfs://ip-172-31-35-122.us-west-2.

Yarn client mode: Setting environment variables

2016-02-17 Thread Lin Zhao
I've been trying to set some environment variables for the spark executors but haven't had much like. I tried editting conf/spark-env.sh but it doesn't get through to the executors. I'm running 1.6.0 and yarn, any pointer is appreciated. Thanks, Lin

Re: Yarn client mode: Setting environment variables

2016-02-18 Thread Lin Zhao
Thanks for the reply. I also found that sparkConf.setExecutorEnv works for yarn. From: Saisai Shao mailto:sai.sai.s...@gmail.com>> Date: Wednesday, February 17, 2016 at 6:02 PM To: Lin Zhao mailto:l...@exabeam.com>> Cc: "user@spark.apache.org<mailto:user@spark.apach

Re: "Failed to bind to" error with spark-shell on CDH5 and YARN

2015-10-25 Thread Lin Zhao
) then it works. From: Steve Loughran Sent: Saturday, October 24, 2015 5:15 AM To: Lin Zhao Cc: user@spark.apache.org Subject: Re: "Failed to bind to" error with spark-shell on CDH5 and YARN On 24 Oct 2015, at 00:46, Lin Zhao mailto:l...@exabeam.com>> wrote: I have a spark