Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-13 Thread Dongjoon Hyun
Yes. From my side, it's -1 for RC3. Bests, Dongjoon. On Sat, Oct 13, 2018 at 1:24 PM Holden Karau wrote: > So if it's a blocker would you think this should be a -1? > > On Fri, Oct 12, 2018 at 3:52 PM Dongjoon Hyun > wrote: > >> Hi, Holden. >> >> Since that's a performance at 2.4.0, I marked a

Re: [VOTE] SPARK 2.4.0 (RC3)

2018-10-13 Thread Holden Karau
So if it's a blocker would you think this should be a -1? On Fri, Oct 12, 2018 at 3:52 PM Dongjoon Hyun wrote: > Hi, Holden. > > Since that's a performance at 2.4.0, I marked as `Blocker` four days ago. > > Bests, > Dongjoon. > > > On Fri, Oct 12, 2018 at 11:45 AM Holden Karau > wrote: > >> Fol

Re: Coalesce behaviour

2018-10-13 Thread Koert Kuipers
we have a collection of programs in dataframe api that all do big shuffles for which we use 2048+ partitions. this works fine but it produces a lot of (small) output files, which put pressure on the memory of the drivers programs of any spark program that reads this data in again. so one of our de

Re: Coalesce behaviour

2018-10-13 Thread Sergey Zhemzhitsky
I've tried the same sample with DataFrame API and it's much more stable although it's backed by RDD API. This sample works without any issues and any additional Spark tuning val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) val df = rdd.map(item => item._1.toString ->

SparkSQL read Hive transactional table

2018-10-13 Thread wys372b
Hi, I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result. However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL read all columns(