Re: Avoiding unnnecessary sort in FileFormatWriter/DynamicPartitionDataWriter

2020-09-04 Thread Cheng Su
Hi, Just for context - I created the JIRA for this around 2 years ago (https://issues.apache.org/jira/browse/SPARK-26164 and a stale PR not merged - https://github.com/apache/spark/pull/23163), and I recently discussed with Wenchen again, it looks like it might be reasonable to: 1. Open mu

Re: Avoiding unnnecessary sort in FileFormatWriter/DynamicPartitionDataWriter

2020-09-04 Thread Reynold Xin
The issue is memory overhead. Writing files create a lot of buffer (especially in columnar formats like Parquet/ORC). Even a few file handlers and buffers per task can OOM the entire process easily. On Fri, Sep 04, 2020 at 5:51 AM, XIMO GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.co

Avoiding unnnecessary sort in FileFormatWriter/DynamicPartitionDataWriter

2020-09-04 Thread XIMO GUANTER GONZALBEZ
Hello, I have observed that if a DataFrame is saved with partitioning columns in Parquet, then a sort is performed in FileFormatWriter (see https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L152) because Dynamic

Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-04 Thread Etienne Chauchot
Hi Jungtaek Lim, Nice to hear from you again since last time we talked :) and congrats on becoming a Spark committer in the meantime ! (if I'm not mistaking you were not at the time) I totally agree with what you're saying on merging structural parts of Spark without having a broader consens

Re: SPIP: Catalog API for view metadata

2020-09-04 Thread John Zhuge
SPIP has been updated. Please review. On Thu, Sep 3, 2020 at 9:22 AM John Zhuge wrote: > Wenchen, sorry for the delay, I will post an update shortly. > > On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan