Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

Michael Armbrust Tue, 28 Jun 2016 18:14:38 -0700

This is not too broadly worded, and in general I would caution that any
interface in org.apache.spark.sql.catalyst or
org.apache.spark.sql.execution is considered internal and likely to change
in between releases.  We do plan to open a stable source/sink API in a
future release.

The problem here is that the DataFrame is constructed using an
incrementalized physical query plan.  If you call any operations on the
Dataframe that change the logical plan, you will loose prior state and the
DataFrame will return an incorrect result.  Since this was discovered late
in the release process we decided it was better to document the current
behavior, rather than do a large refactoring.

On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Looking at the Sink in 2.0 there is a warning (added in SPARK-16020
> without a lot of details) that says "Note: You cannot apply any operators
> on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering
> if this restriction is perhaps too broadly worded? Provided that we consume
> the data in a blocking fashion could we apply some other transformation
> beforehand? Or is there a better way to get equivalent foreachRDD
> functionality with the structured streaming API?
>
> On somewhat of tangent - would it maybe make sense to mark transformations
> on Datasets which are not supported for Streaming use (e.g. toJson etc.)?
>
> Cheers,
>
> Holden :)
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

Reply via email to