I have a need to route the dstream through the streming pipeline by some key,
such that data with the same key always goes through the same executor.
There doesn't seem to be a way to do manual routing with Spark Streaming. The
closest I can come up with is:
stream.foreachRDD {rdd =>
rdd.grou
I tried to build 1.6.0 for yarn and scala 2.11, but have an error. Any help is
appreciated.
[warn] Strategy 'first' was applied to 2 files
[info] Assembly up to date:
/Users/lin/git/spark/network/yarn/target/scala-2.11/spark-network-yarn-1.6.0-hadoop2.7.1.jar
java.lang.IllegalStateException:
a low cpu:memory
ratio.
From: Tathagata Das mailto:t...@databricks.com>>
Date: Thursday, January 7, 2016 at 1:56 PM
To: Lin Zhao mailto:l...@exabeam.com>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Spark streaming routing
You cannot guarantee that each key will forever be on
My job runs fine in yarn cluster mode but I have reason to use client mode
instead. But I'm hitting this error when submitting:
> spark-submit --class com.exabeam.martini.scripts.SparkStreamingTest --master
> yarn --deploy-mode client --executor-memory 90G --num-executors 3
> --executor-cores 1
and
>things are failing because of that.
>
>On Wed, Jan 13, 2016 at 9:31 AM, Lin Zhao wrote:
>> My job runs fine in yarn cluster mode but I have reason to use client
>>mode
>> instead. But I'm hitting this error when submitting:
>>> spark-submit --class com.exa
Hi,
I'm testing spark streaming with actor receiver. The actor keeps calling
store() to save a pair to Spark.
Once the job is launched, on the UI everything looks good. Millions of events
gets through every batch. However, I added logging to the first step and found
that only 20 or 40 events i
lockManager: Removing RDD 41
16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 40
16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 39
From: "Shixiong(Ryan) Zhu"
mailto:shixi...@databricks.com>>
Date: Thursday, January 14, 2016 at 4:13 PM
To: Lin Zhao mailt
;
Date: Thursday, January 14, 2016 at 4:41 PM
To: Lin Zhao mailto:l...@exabeam.com>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Spark Streaming: custom actor receiver losing vast majority of data
Could you post the codes of MessageRetriever? And by the way, could you post
t
I have requirement to route a paired DStream to a series of map and flatMap
such that entries with the same key goes to the same thread within the same
batch. Closest I can come up with is groupByKey().flatMap(_._2). But this kills
throughput by 50%.
When I think about it groupByKey is more tha
When the state is passed to the task that handles a mapWithState for a
particular key, if the key is distributed, it seems extremely difficult to
coordinate and synchronise the state. Is there a partition by key before a
mapWithState? If not what exactly is the execution model?
Thanks,
Lin
Hi Silvio,
Can you go into a little detail how the back pressure work? Does it block
the receiver? Or does it temporarily saves the incoming messages in
mem/disk? I have a custom actor receiver that uses store() to save dataa
to spark. Would the back pressure make store() call block?
On 1/17/16,
I'm using mapWithState, and hit
https://issues.apache.org/jira/browse/SPARK-12591. While 1.6.1 is not released,
I tried the workaround in the comment. But I had these error in one of the
nodes. While millions of events go throught the mapWithState, only 7 show up in
the log. Is this related to
I have an actor receiver that reads data and calls "store()" to save data to
spark. I was hoping spark.streaming.receiver.maxRate and
spark.streaming.backpressure would help me block the method when needed to
avoid overflowing the pipeline. But it doesn't. My actor pumps millions of
lines to sp
One solution is to read the scheduling delay and my actor can go to sleep if
needed. Is this possible?
From: Lin Zhao mailto:l...@exabeam.com>>
Date: Wednesday, January 27, 2016 at 5:28 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apach
uot;Stored ${storeCount.get()} messages to spark}")
}
From: Iulian DragoČ™
mailto:iulian.dra...@typesafe.com>>
Date: Thursday, January 28, 2016 at 5:33 AM
To: Lin Zhao mailto:l...@exabeam.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spar
I'm seeing this error in the driver when running a streaming job. Not sure If
it's critical.
It happens maybe half of time checkpoint is saved. There are retries in the log
but sometimes results in "Could not write checkpoint for time 145400632 ms
to file
hdfs://ip-172-31-35-122.us-west-2.
I've been trying to set some environment variables for the spark executors but
haven't had much like. I tried editting conf/spark-env.sh but it doesn't get
through to the executors. I'm running 1.6.0 and yarn, any pointer is
appreciated.
Thanks,
Lin
Thanks for the reply. I also found that sparkConf.setExecutorEnv works for
yarn.
From: Saisai Shao mailto:sai.sai.s...@gmail.com>>
Date: Wednesday, February 17, 2016 at 6:02 PM
To: Lin Zhao mailto:l...@exabeam.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apach
) then it works.
From: Steve Loughran
Sent: Saturday, October 24, 2015 5:15 AM
To: Lin Zhao
Cc: user@spark.apache.org
Subject: Re: "Failed to bind to" error with spark-shell on CDH5 and YARN
On 24 Oct 2015, at 00:46, Lin Zhao mailto:l...@exabeam.com>>
wrote:
I have a spark
19 matches
Mail list logo