Re: Coalesce behaviour

2018-10-12 Thread Wenchen Fan
In your first example, the root RDD has 1000 partitions, then you do a shuffle (with repartitionAndSortWithinPartitions), and shuffles data to 1000 reducers. Then you do coalesce, which asks Spark to launch only 20 reducers to process the data which were prepared for 1 reducers. since the reduc

Re: Coalesce behaviour

2018-10-12 Thread Koert Kuipers
how can i get a shuffle with 2048 partitions and 2048 tasks and then a map phase with 10 partitions and 10 tasks that writes to hdfs? every time i try to do this using coalesce the shuffle ends up having 10 tasks which is unacceptable due to OOM. this makes coalesce somewhat useless. On Wed, Oct

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-12 Thread Dongjoon Hyun
Hi, Holden. Since that's a performance at 2.4.0, I marked as `Blocker` four days ago. Bests, Dongjoon. On Fri, Oct 12, 2018 at 11:45 AM Holden Karau wrote: > Following up I just wanted to make sure this new blocker that Dongjoon > designated is surfaced - > https://jira.apache.org/jira/browse

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-12 Thread Holden Karau
Following up I just wanted to make sure this new blocker that Dongjoon designated is surfaced - https://jira.apache.org/jira/browse/SPARK-25579?filter=12340409&jql=affectedVersion%20%3D%202.4.0%20AND%20cf%5B12310320%5D%20is%20EMPTY%20AND%20project%20%3D%20spark%20AND%20(status%20%3D%20%22In%20Progr

Re: Coalesce behaviour

2018-10-12 Thread Sergey Zhemzhitsky
... sorry for that, but there is a mistake in the second sample, here is the right one // fails with either OOM or 'Container killed by YARN for exceeding memory limits ... spark.yarn.executor.memoryOverhead' rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinParti

Re: Coalesce behaviour

2018-10-12 Thread Sergey Zhemzhitsky
I'd like to reduce the number of files written to hdfs without introducing additional shuffles and at the same time to preserve the stability of the job, and also I'd like to understand why the samples below work in one case and fail in another one. Consider the following example which does the sa

Re: Timestamp Difference/operations

2018-10-12 Thread John Zhuge
Yeah, operator "-" does not seem to be supported, however, you can use "datediff" function: In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP), CAST('2000-01-01 00:00:00' AS TIMESTAMP)) Out[9]: +-

Code review and Coding livestreams today

2018-10-12 Thread Holden Karau
I’ll be doing my regular weekly code review at 10am Pacific today - https://youtu.be/IlH-EGiWXK8 with a look at the current RC, and in the afternoon at 3pm Pacific I’ll be doing some live coding around WIP graceful decommissioning PR - https://youtu.be/4FKuYk2sbQ8 -- Twitter: https://twitter.com/h

Timestamp Difference/operations

2018-10-12 Thread Paras Agarwal
Hello Spark Community, Currently in hive we can do operations on Timestamp Like : CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS TIMESTAMP) Seems its not supporting in spark. Is there any way available. Kindly provide some insight on this. Paras 9130006036

Re: Remove Flume support in 3.0.0?

2018-10-12 Thread Hyukjin Kwon
Yea, I thought we are already going to remove this out. +1 for removing it anyway. 2018년 10월 12일 (금) 오전 1:44, Wenchen Fan 님이 작성: > Note that, it was deprecated in 2.3.0 already: > https://spark.apache.org/docs/2.3.0/streaming-flume-integration.html > > On Fri, Oct 12, 2018 at 12:46 AM Reynold Xin

SparkSQL read Hive transactional table

2018-10-12 Thread daily
Hi, I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result. However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL read all columns