Re: Handling user-facing metadata issues on file stream source & sink

2020-06-25 Thread Jungtaek Lim
Bump + adding one more issue I fixed (and by chance there's relevant report in user mailing list recently) * [SPARK-30462][SS] Streamline the logic on file stream source and sink to avoid memory issue [1] The patch stabilizes the driver's memory usage on utilizing a huge metadata log, which was t

Re: [DISCUSS][SPIP] Graceful Decommissioning

2020-06-25 Thread Holden Karau
Thanks for looping in more folks :) On Thu, Jun 25, 2020 at 7:41 PM Hyukjin Kwon wrote: > Thank you so much, Holden. > > PS: I cc'ed some people who might be interested in this too FYI. > > 2020년 6월 26일 (금) 오전 11:26, Holden Karau 님이 작성: > >> At the recommendation of Hyukjin, I'm converting the g

Re: [DISCUSS][SPIP] Graceful Decommissioning

2020-06-25 Thread Hyukjin Kwon
Thank you so much, Holden. PS: I cc'ed some people who might be interested in this too FYI. 2020년 6월 26일 (금) 오전 11:26, Holden Karau 님이 작성: > At the recommendation of Hyukjin, I'm converting the graceful > decommissioning work to an SPIP. The SPIP document is at > https://docs.google.com/document

[DISCUSS][SPIP] Graceful Decommissioning

2020-06-25 Thread Holden Karau
At the recommendation of Hyukjin, I'm converting the graceful decommissioning work to an SPIP. The SPIP document is at https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing and the associated JIRA is at https://issues.apache.org/jira/browse/SPARK-20624. Th

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-25 Thread Waleed Fateem
I was trying to make my email short and concise, but the rationale behind setting that as 1 by default is because it's safer. With algorithm version 2 you run the risk of having bad data in cases where tasks fail or even duplicate data if a task fails and succeeds on a reattempt (I don't know if th

Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-25 Thread Sean Owen
I think is a Hadoop property that is just passed through? if the default is different in Hadoop 3 we could mention that in the docs. i don't know if we want to always set it to 1 as a Spark default, even in Hadoop 3 right? On Thu, Jun 25, 2020 at 2:43 PM Waleed Fateem wrote: > > Hello! > > I noti

Re: [Spark SQL] Question about support for TimeType columns in Apache Parquet files

2020-06-25 Thread Rylan Dmello
Hello Bart, Thank you for sharing these links, this was exactly what Tahsin and I were looking for. It looks like there has been a lot of discussion about this already, which is good to see. In one of these pull requests, there is a comment about the number of real-world use-cases for some kin

Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-25 Thread Waleed Fateem
Hello! I noticed that in the documentation starting with 2.2.0 it states that the parameter spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 by default: https://issues.apache.org/jira/browse/SPARK-20107 I don't actually see this being set anywhere explicitly in the Spark code and

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

2020-06-25 Thread Hyukjin Kwon
I dont have a strong opinion on changing default too but I also a little bit more prefer to have the option to switch Hadoop version first just to stay safer. To be clear, we're more now discussing about the timing about when to set Hadoop 3.0.0 by default, and which change has to be first, right?