Very interesting ideas Andy!

Conceptually i think it makes sense. In fact, it is true that dealing with
time series data, windowing over application time, windowing over number of
events, are things that DStream does not natively support. The real
challenge is actually mapping the conceptual windows with the underlying
RDD model. On aspect you correctly observed in the ordering of events
within the RDDs of the DStream. Another fundamental aspect is the fact that
RDDs as parallel collections, with no well-defined ordering in the records
in the RDDs. If you want to process the records in an RDD as a ordered
stream of events, you kind of have to process the stream sequentially,
which means you have to process each RDD partition one-by-one, and
therefore lose the parallelism. So implementing all these functionality may
mean adding functionality at the cost of performance. Whether that is okay
for Spark Streaming to have these OR this tradeoff is not-intuitive for
end-users and therefore should not come out-of-the-box with Spark Streaming
-- that is a definitely a question worth debating upon.

That said, for some limited usecases, like windowing over N events, can be
implemented using custom RDDs like SlidingRDD
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala>
without
losing parallelism. For things like app time based windows, and
random-application-event based windows, its much harder.

Interesting ideas nonetheless. I am curious to see how far we can push
using the RDD model underneath, without losing parallelism and performance.

TD



On Tue, Jul 15, 2014 at 10:11 AM, andy petrella <andy.petre...@gmail.com>
wrote:

> Dear Sparkers,
>
> *[sorry for the lengthy email... => head to the gist
> <https://gist.github.com/andypetrella/12228eb24eea6b3e1389> for a preview
> :-p**]*
>
> I would like to share some thinking I had due to a use case I faced.
> Basically, as the subject announced it, it's a generalization of the
> DStream currently available in the streaming project.
> First of all, I'd like to say that it's only a result of some personal
> thinking, alone in the dark with a use case, the spark code, a sheet of
> paper and a poor pen.
>
>
> DStream is a very great concept to deal with micro-batching use cases, and
> it does it very well too!
> Also, it hardly relies on the elapsing time to create its internal
> micro-batches.
> However, there are similar use cases where we need micro-batches where this
> constraint on the time doesn't hold, here are two of them:
> * a micro-batch has to be created every *n* events received
> * a micro-batch has to be generate based on the values of the items pushed
> by the source (which might even not be a stream!).
>
> An example of use case (mine ^^) would be
> * the creation of timeseries from a cold source containing timestamped
> events (like S3).
> * one these timeseries have cells being the mean (sum, count, ...) of one
> of the fields of the event
> * the mean has to be computed over a window depending on a field
> *timestamp*.
>
> * a timeserie is created for each type of event (the number of types is
> high)
> So, in this case, it'd be interesting to have an RDD for each cell, which
> will generate all cells for all neede timeseries.
> It's more or less what DStream does, but here it won't help due what was
> stated above.
>
> That's how I came to a raw sketch of what could be named ContinuousRDD
> (CRDD) which is basically and RDD[RDD[_]]. And, for the sake of simplicity
> I've stuck with the definition of a DStream to think about it. Okay, let's
> go ^^.
>
>
> Looking at the DStream contract, here is something that could be drafted
> around CRDD.
> A *CRDD* would be a generalized concept that relies on:
> * a reference space/continuum (to which data can be bound)
> * a binning function that can breaks the continuum into splits.
> Since *Space* is a continuum we could define it as:
> * a *SpacePoint* (the origin)
> * a SpacePoint=>SpacePoint (the continuous function)
> * a Ordering[SpacePoint]
>
> DStream uses a *JobGenerator* along with a DStreamGraph, which are using
> timer and clock to do their work, in the case of a CRDD we'll have to
> define also a point generator, as a more generic but also adaptable
> concept.
>
>
> So far (so good?), these definition should work quite fine for *ordered*
> space
> for which:
> * points are coming/fetched in order
> * the space is fully filled (no gaps)
> For these cases, the JobGenerator (f.i.) could be defined with two extra
> functions:
> * one is responsible to chop the batches even if the upper bound of the
> batch hasn't been seen yet
> * the other is responsible to handle outliers (and could wrap them into yet
> another CRDD ?)
>
>
> I created a gist here wrapping up the types and thus the skeleton of this
> idea, you can find it here:
> https://gist.github.com/andypetrella/12228eb24eea6b3e1389
>
> WDYT?
> *The answer can be: you're a fool!*
> Actually, I already I am, but also I like to know why.... so some
> explanations will help me :-D.
>
> Thanks to read 'till this point.
>
> Greetz,
>
>
>
>  aℕdy ℙetrella
> about.me/noootsab
> [image: aℕdy ℙetrella on about.me]
>
> <http://about.me/noootsab>
>

Reply via email to