from:"qihong"

Is possible to invoke updateStateByKey twice on the same RDD

2014-10-09 Thread qihong

I need to implement following logic in a spark streaming app: for the incoming dStream, do some transformation, and invoke updateStateByKey to update state object for each key (mark data entries that are updated as dirty for next step), then let state objects produce event(s) based (based on dirt

Re: Spark Streaming - batchDuration for streaming

2014-09-17 Thread qihong

Here's official spark document about batch size/interval: http://spark.apache.org/docs/latest/streaming-programming-guide.html#setting-the-right-batch-size spark is batch oriented processing. As you mentioned, the streaming is continuous flow, and core spark can not handle it. Spark streaming br

Re: Dealing with Time Series Data

2014-09-17 Thread qihong

what are you trying to do? generate time series from your data in HDFS, or doing some transformation and/or aggregation from your time series data in HDFS? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dealing-with-Time-Series-Data-tp14275p14482.html Sent

RE: Change RDDs using map()

2014-09-17 Thread qihong

if you want the result as RDD of (key, 1) new_rdd = rdd.filter(x => x._2 == 1) if you want result as RDD of keys (since you know the values are 1), then new_rdd = rdd.filter(x => x._2 == 1).map(x => x._1) x._1 and x._2 are the way of scala to access the key and value from key/value pair.

Re: How to initialize StateDStream

2014-09-13 Thread qihong

I'm not sure what you mean by "previous run". Is it previous batch? or previous run of spark-submit? If it's "previous batch" (spark streaming creates a batch every batch interval), then there's nothing to do. If it's previous run of spark-submit (assuming you are able to save the result somewher

Re: How to initialize StateDStream

2014-09-12 Thread qihong

there's no need to initialize StateDStream. Take a look at example StatefulNetworkWordCount.scala, it's part of spark source code. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14146.html Sent from the Apache Spark Us

Re: compiling spark source code

2014-09-12 Thread qihong

follow the instruction here: http://spark.apache.org/docs/latest/building-with-maven.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14144.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: how to setup steady state stream partitions

2014-09-10 Thread qihong

Thanks for your response! I found that too, and it does the trick! Here's refined code: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use sensorId as key val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new MyPartitioner(...))) val stateDStream = part

Re: How to set java.library.path in a spark cluster

2014-09-09 Thread qihong

Add something like following to spark-env.sh export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH (and remove all 5 exports you listed). Then restart all worker nodes, and try again. Good luck! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-java-library-pa

Re: how to choose right DStream batch interval

2014-09-09 Thread qihong

Hi Mayur, Thanks for your response. I did write a simple test that set up a DStream with 5 batches; The batch duration is 1 second, and the 3rd batch will take extra 2 seconds, the output of the test shows that the 3rd batch causes backlog, and spark streaming does catch up on 4th and 5th batch (

Re: how to setup steady state stream partitions

2014-09-09 Thread qihong

Thanks for your response. I do have something like: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use sensorId as key val partitionedDStream = keyedDstream.transform(rdd => rdd.partitionBy(new MyPartitioner(...))) val stateDStream = partitionedDStream.updateStateByKey[...](ud

how to setup steady state stream partitions

2014-09-09 Thread qihong

I'm working on a DStream application. The input are sensors' measurements, the data format is There are 10 thousands sensors, and updateStateByKey is used to maintain the states of sensors, the code looks like following: val inputDStream = ... val keyedDStream = inputDStream.map(...) // use s

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread qihong

Since you are using your home computer, so it's probably not reachable by EC2 from internet. You can try to set "spark.driver.host" to your WAN ip, "spark.driver.port" to a fixed port in SparkConf, and open that port in your home network (port forwarding to the computer you are using). see if that

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread qihong

the command should be "spark-shell --master spark://:7077". -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13593.html Sent from the Apache Spark User List mailing list archive at Nabble

Re: how to choose right DStream batch interval

2014-09-05 Thread qihong

factor or failures), then what's the right batch interval? 5 seconds (the worst case)? 2. What will happen to DStream processing if 1 batch took longer than batch interval? Can Spark recover from that? Thanks, Qihong -- View this message in context: http://apache-spark-user-list.100156

Is possible to invoke updateStateByKey twice on the same RDD

Re: Spark Streaming - batchDuration for streaming

Re: Dealing with Time Series Data

RE: Change RDDs using map()

Re: How to initialize StateDStream

Re: How to initialize StateDStream

Re: compiling spark source code

RE: how to setup steady state stream partitions

Re: How to set java.library.path in a spark cluster

Re: how to choose right DStream batch interval

Re: how to setup steady state stream partitions

how to setup steady state stream partitions

Re: Running spark-shell (or queries) over the network (not from master)

Re: Running spark-shell (or queries) over the network (not from master)

Re: how to choose right DStream batch interval

15 matches

Site Navigation

Mail list logo

Footer information