Re: Feature request: split dataset based on condition

2019-02-03 Thread Sean Owen
I don't think Spark supports this model, where N inputs depending on parent are computed once at the same time. You can of course cache the parent and filter N times and do the same amount of work. One problem is, where would the N inputs live? they'd have to be stored if not used immediately, and

Re: Feature request: split dataset based on condition

2019-02-03 Thread Maciej Szymkiewicz
If the goal is to split the output, then `DataFrameWriter.partitionBy` should do what you need, and no additional methods are required. If not you can also check Silex's implementation muxPartitions (see https://stackoverflow.com/a/37956034), but the applications are rather limited, due to high res

GSoC 2019 : Contributing to Apache Spark

2019-02-03 Thread Vishal Gupta
Hi I'm a Python Developer (& Data Scientist) and I contributed to Debian[1][2] last year as a part of Google Summer of Code[3]. Having used Lucene, Kafka and Spark in the past, I wanted to work on at least one of them this summer. Since Spark uses Python[4] (API) unlike the others, I felt I could g

[DISCUSS] SPIP: Identifiers for multi-catalog Spark

2019-02-03 Thread Ryan Blue
Hi everyone, This is a follow-up to the "Identifiers with multi-catalog support" discussion thread. I've taken the proposal I posted to that thread and written it up as an official SPIP for how to identify tables and other catalog objects when working with multiple catalogs. The doc is available