Re: Optimising multiple hive table join and query in spark

2020-03-16 Thread Manjunath Shetty H
Shetty H Subject: Re: Optimising multiple hive table join and query in spark Hi, if you only have 1.6, forget bucketing. https://databricks.com/session/bucketing-in-spark-sql-2-3 that only works well with Hive from 2.3 onwards. The other thing in your (daily?) batch job val myData = spark.read

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Manjunath Shetty H
do shuffle again. - Manjunath From: ayan guha mailto:guha.a...@gmail.com>> Sent: Sunday, March 15, 2020 5:46 PM To: Magnus Nilsson mailto:ma...@kth.se>> Cc: user mailto:user@spark.apache.org>> Subject: Re: Optimising multiple hive table join an

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Georg Heiler
join it wont do shuffle again. > > > - > Manjunath > -- > *From:* ayan guha > *Sent:* Sunday, March 15, 2020 5:46 PM > *To:* Magnus Nilsson > *Cc:* user > *Subject:* Re: Optimising multiple hive table join and query in spark > > Hi > > I would f

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Manjunath Shetty H
uring the join it wont do shuffle again. - Manjunath From: ayan guha Sent: Sunday, March 15, 2020 5:46 PM To: Magnus Nilsson Cc: user Subject: Re: Optimising multiple hive table join and query in spark Hi I would first and foremost try to identify where is the

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread ayan guha
Hi I would first and foremost try to identify where is the most time spend during the query. One possibility is it just takes ramp up time for executors to be available, if thats the case then maybe a dedicated yarn queue may help, or using Spark thriftserver may help. On Sun, Mar 15, 2020 at 11:

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Magnus Nilsson
Been a while but I remember reading on Stack Overflow you can use a UDF as a join condition to trick catalyst into not reshuffling the partitions, ie use regular equality on the column you partitioned or bucketed by and your custom comparer for the other columns. Never got around to try it out houg

Re: Optimising multiple hive table join and query in spark

2020-03-15 Thread Dennis Suhari
Hi, I am also using Spark on Hive Metastore. The performance is much more better esp. for larger datasets. I have the feeling that the performance is better if I load the data into dataframes and do a join instead of doing direct join within SparkSQL. But i can’t explain yet. Any body experie

Optimising multiple hive table join and query in spark

2020-03-14 Thread Manjunath Shetty H
Hi All, We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We are serving a usecase on top of that by joining 4-5 tables using Hive as of now. But it is not fast as we wanted it to be, so we are thinking of using spark for this use case. Any suggestion on this ? Is it go