Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread DandyDev
Hi all!

When reading about Spark Streaming and its execution model, I see diagrams
like this a lot:


 

It does a fine job explaining how DStreams consist of micro batches that are
basically RDDs. There are however some things I don't understand:

- RDDs are distributed by design, but micro batches are conceptually small.
How/why are these micro batches distributed so that they need to be
implemented as RDD?
- The above image doesn't explain how Spark Streaming parallelizes data.
According to the image, a stream of events get broken into micro batches
over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is a micro
batch, etc.). How does parallelism come into play here? Is it that even
within a "time slot" (eg. time 0 to 1) there can be so many events, that
multiple micro batches for that time slot will be created and distributed
across the executors?

Clarification would be helpful!

Daan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-dividing-DStream-into-mini-batches-tp27699.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Can mapWithState state func be called every batchInterval?

2016-10-11 Thread DandyDev
Hi there,

I've built a Spark Streaming app that accepts certain events from Kafka, and
I want to keep some state between the events. So I've successfully used
mapWithState for that. The problem is, that I want the state for keys to be
updated on every batchInterval, because "lack" of events is also significant
to the use case. This doesn't seem possible with mapWithState, unless I'm
missing something.

Previously I looked at updateStateByKey, which says:
> In every batch, Spark will apply the state update function for all
> existing keys, regardless of whether they have new data in a batch or not.

That is what I want, however, I've seen several tutorials/blog posts where
the advise was not to use updateStateByKey anymore, and use mapWithState
instead.

So my questions:

- Can mapWithState state function be called every batchInterval, even when
no events exist for that interval?
- If not, is it okay to use updateStateByKey instead? Or will it be
deprecated in the near future?
- If mapWithState doesn't support my need, is there another way to
accomplish the goal of updating state every batchInterval, that still uses
mapWithState, together with some other mechanism?

Thanks in advance!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-mapWithState-state-func-be-called-every-batchInterval-tp27877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Reusing HBase connection in transformations

2017-02-14 Thread DandyDev
Hi!

I'm struggling with the following problem: I have a couple of Spark
Streaming jobs that keep state (using mapWithState, and in one case
updateStateByKey) and write their results to HBase. One of the Streaming
jobs, needs the results that the other Streaming job writes to HBase.
How it's currently implemented, is that within the state function, data is
read from HBase that is used in calculations. The drawback, is that for each
time the state function is called, a new connection is opened to HBase.
In the Spark Streaming guide, it is suggested that you reuse the same
connection within one partition, but this applies only to /actions/ (ie.
foreachRDD). How would you do it for transformations (like mapWithState)?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reusing-HBase-connection-in-transformations-tp28389.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org