Ok, that makes sense (the JIRA where the restriction note was added didn't have a lot of details). So for now, would converting to an RDD inside of a custom Sink and then doing your operations on that be a reasonable work around?
On Tuesday, June 28, 2016, Michael Armbrust <mich...@databricks.com> wrote: > This is not too broadly worded, and in general I would caution that any > interface in org.apache.spark.sql.catalyst or > org.apache.spark.sql.execution is considered internal and likely to change > in between releases. We do plan to open a stable source/sink API in a > future release. > > The problem here is that the DataFrame is constructed using an > incrementalized physical query plan. If you call any operations on the > Dataframe that change the logical plan, you will loose prior state and the > DataFrame will return an incorrect result. Since this was discovered late > in the release process we decided it was better to document the current > behavior, rather than do a large refactoring. > > On Tue, Jun 28, 2016 at 12:59 PM, Holden Karau <hol...@pigscanfly.ca > <javascript:_e(%7B%7D,'cvml','hol...@pigscanfly.ca');>> wrote: > >> Looking at the Sink in 2.0 there is a warning (added in SPARK-16020 >> without a lot of details) that says "Note: You cannot apply any operators >> on `data` except consuming it (e.g., `collect/foreach`)." but I'm wondering >> if this restriction is perhaps too broadly worded? Provided that we consume >> the data in a blocking fashion could we apply some other transformation >> beforehand? Or is there a better way to get equivalent foreachRDD >> functionality with the structured streaming API? >> >> On somewhat of tangent - would it maybe make sense to mark >> transformations on Datasets which are not supported for Streaming use (e.g. >> toJson etc.)? >> >> Cheers, >> >> Holden :) >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau >> > > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau