Re: [SS] Allowing stream Sink metadata as part of checkpoint?

2019-02-25 Thread Arun Mahadevan
Unless its some sink metadata to be maintained by the framework (e.g sink state that needs to be passed back to the sink etc), would it make sense to keep it under the checkpoint dir ? Maybe I am missing the motivation of the proposed approach but I guess the sink mostly needs to store the last se

Re: Thoughts on dataframe cogroup?

2019-02-25 Thread chris
Just to add to this I’ve also implemented my own cogroup previously and would welcome a cogroup for datafame. My specific use case was that I had a large amount of time series data. Spark has very limited support for time series (specifically as-of joins), but pandas has good support. My solut

[SS] Allowing stream Sink metadata as part of checkpoint?

2019-02-25 Thread Jungtaek Lim
Hi devs, I was about to give it a try, but it would relate to DSv2 so decide to initiate new thread before actual work. I also don't think this should be along with DSv2 discussion since the change would be minor. While dealing with SPARK-24295 [1] and SPARK-26411 [2], I feel the needs of partici

Re: Thoughts on dataframe cogroup?

2019-02-25 Thread Jonathan Winandy
For info, in our team have defined our own cogroup on dataframe in the past on different projects using different methods (rdd[row] based or union all collect list based). I might be biased, but find the approach very useful in project to simplify and speed up transformations, and remove a lot of

SPARK-25299 Updates Feb 2019

2019-02-25 Thread Yifei Huang (PD)
Hi everyone, Last year, we started working on using other types of storage for more reliance during Spark shuffle as tracked by this jira ticket. We’ve compiled a document outlining the progress we’ve made since our last update in December. Specifically, this includes: A summary of the work