Ok, that makes sense (the JIRA where the restriction note was added didn't
have a lot of details). So for now, would converting to an RDD inside of a
custom Sink and then doing your operations on that be a reasonable work
around?

On Tuesday, June 28, 2016, Michael Armbrust <mich...@databricks.com> wrote:

> This is not too broadly worded, and in general I would caution that any
> interface in org.apache.spark.sql.catalyst or
> org.apache.spark.sql.execution is considered internal and likely to change
> in between releases.  We do plan to open a stable source/sink API in a
> future release.
>
> The problem here is that the DataFrame is constructed using an
> incrementalized physical query plan.  If you call any operations on the
> Dataframe that change the logical plan, you will loose prior state and the
> DataFrame will return an incorrect result.  Since this was discovered late
> in the release process we decided it was better to document the current
> behavior, rather than do a large refactoring.
>
> On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau <hol...@pigscanfly.ca
> <javascript:_e(%7B%7D,'cvml','hol...@pigscanfly.ca');>> wrote:
>
>> Looking at the Sink in 2.0 there is a warning (added in SPARK-16020
>> without a lot of details) that says "Note: You cannot apply any operators
>> on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering
>> if this restriction is perhaps too broadly worded? Provided that we consume
>> the data in a blocking fashion could we apply some other transformation
>> beforehand? Or is there a better way to get equivalent foreachRDD
>> functionality with the structured streaming API?
>>
>> On somewhat of tangent - would it maybe make sense to mark
>> transformations on Datasets which are not supported for Streaming use (e.g.
>> toJson etc.)?
>>
>> Cheers,
>>
>> Holden :)
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Reply via email to