Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

Shivaram Venkataraman Mon, 26 Sep 2016 17:13:40 -0700

Disclaimer - I am not very closely involved with Structured Streaming
design / development, so this is just my two cents from looking at the
discussion in the linked JIRAs and PRs.

It seems to me there are a couple of issues being conflated here: (a)
is the question of how to specify or add more functionality to the
Sink API such as ability to get model updates back to the driver [A
design issue IMHO] (b) question of how to pass parameters to
DataFrameWriter, esp. strings vs. typed objects and whether the API is
stable vs. experimental

TLDR is that I think we should first focus on refactoring the Sink and
add new functionality after that. Detailed comments below.

Sink design / functionality: Looking at SPARK-10815, a JIRA linked
from SPARK-16407, it looks like the existing Sink API is limited
because it is tied to the RDD/Dataframe definitions. It also has
surprising limitations like not being able to run operators on `data`
and only using `collect/foreach`.  Given these limitations, I think it
makes sense to redesign the Sink API first *before* adding new
functionality to the existing Sink. I understand that we have not
marked this experimental in 2.0.0 -- but I guess since
StructuredStreaming is new as a whole, so we can probably break the
Sink API in a upcoming 2.1.0 release.

As a part of the redesign, I think we need to do two things: (i) come
up with a new data handle that separates RDD from what is passed to
the Sink (ii) Have some way to specify code that can run on the
driver. This might not be an issue if the data handle already has
clean abstraction for this.

Micro-batching: Ideally it would be good to not expose the micro-batch
processing model in the Sink API as this might change going forward.
Given the consistency model we are presenting I think there will be
some notion of batch / time-range identifier in the API. But I think
if we can avoid having hard constraints on where functions will get
run (i.e. on the driver vs. as a part of a job etc.) and when
functions will get run (i.e. strictly after every micro-batch) it
might give us more freedom in improving performance going forward [1].

Parameter passing: I think your point that typed is better than
untyped is pretty good and supporting both APIs isn't necessarily bad
either. My understand of the discussion around this is that we should
do this after Sink is refactored to avoid exposing the old APIs ?

Thanks
Shivaram

[1] FWIW this is something I am looking at and
https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/
has some details about this.

On Mon, Sep 26, 2016 at 1:38 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> Hi Spark Developers,
>
>
> After some discussion on SPARK-16407 (and on the PR) we’ve decided to jump
> back to the developer list (SPARK-16407 itself comes from our early work on
> SPARK-16424 to enable ML with the new Structured Streaming API). SPARK-16407
> is proposing to extend the current DataStreamWriter API to allow users to
> specify a specific instance of a StreamSinkProvider - this makes it easier
> for users to create sinks that are configured with things besides strings
> (for example things like lambdas). An example of something like this already
> inside Spark is the ForeachSink.
>
>
> We have been working on adding support for online learning in Structured
> Streaming, similar to what Spark Streaming and MLLib provide today. Details
> are available in  SPARK-16424. Along the way, we noticed that there is
> currently no way for code running in the driver to access the streaming
> output of a Structured Streaming query (in our case ideally as an Dataset or
> RDD - but regardless of the underlying data structure). In our specific
> case, we wanted to update a model in the driver using aggregates computed by
> a Structured Streaming query.
>
>
> A lot of other applications are going to have similar requirements. For
> example, there is no way (outside of using private Spark internals)* to
> implement a console sink with a user supplied formatting function, or
> configure a templated or generic sink at runtime, trigger a custom Python
> call-back or even implement the ForeachSink outside of Spark. For work
> inside of Spark to enable Structured Streaming with ML we clearly don’t need
> SPARK-16407 as we can directly access the internals (although it would be
> cleaner to not have to) but if we want to empower people working outside of
> the Spark codebase itself with Structured Streaming I think we need to
> provide some mechanism for this and it would be great to see what
> options/ideas the community can come up with.
>
>
> One of the arguments against SPARK-16407 seems to be mostly that it exposes
> the Sink API which is implemented using micro-batching, but the counter
> argument to this is that the Sink API is already exposed (instead of passing
> in an instance the user needs to pass in a class name which is then created
> through reflection and has configuration parameters passed in as a map of
> strings).
>
>
> Personally I think we should exposed a more nicely typed API instead of
> depending on Strings for all configuration, and that if at some point the
> Sink API itself needs to change if/when Spark Streaming moves away from
> micro-batching we would still likely want to allow users to provide the
> typed interface as well to give Sink creators more flexibility with
> configuration.
>
>
> Now obviously this is based on my understanding of the lay of the land which
> could be a little off since the Spark Structured Streaming design docs and
> JIRAs don’t seem to be being actively updated - so I’d love to know what
> assumptions I’ve made that don’t match the current plans for structured
> streaming.
>
>
> Cheers,
>
>
> Holden :)
>
>
> Related Links:
>
> The JIRA for this proposal https://issues.apache.org/jira/browse/SPARK-16407
>
> The Structured Streaming ML JIRA
> https://issues.apache.org/jira/browse/SPARK-16424
>
> https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing
>
> https://github.com/apache/spark/pull/14691
>
> https://github.com/holdenk/spark-structured-streaming-ml
>
>
> *Strictly speaking one _could_ pass in a string of Java code and then
> compile it inside the Sink with Janino - but that clearly isn’t reasonable.
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

Reply via email to