Shetty H
Subject: Re: Optimising multiple hive table join and query in spark
Hi,
if you only have 1.6, forget bucketing.
https://databricks.com/session/bucketing-in-spark-sql-2-3 that only works well
with Hive from 2.3 onwards.
The other thing in your (daily?) batch job
val myData =
spark.read
do shuffle again.
-
Manjunath
From: ayan guha mailto:guha.a...@gmail.com>>
Sent: Sunday, March 15, 2020 5:46 PM
To: Magnus Nilsson mailto:ma...@kth.se>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Optimising multiple hive table join an
join it wont do shuffle again.
>
>
> -
> Manjunath
> --
> *From:* ayan guha
> *Sent:* Sunday, March 15, 2020 5:46 PM
> *To:* Magnus Nilsson
> *Cc:* user
> *Subject:* Re: Optimising multiple hive table join and query in spark
>
> Hi
>
> I would f
uring the join it wont do shuffle again.
-
Manjunath
From: ayan guha
Sent: Sunday, March 15, 2020 5:46 PM
To: Magnus Nilsson
Cc: user
Subject: Re: Optimising multiple hive table join and query in spark
Hi
I would first and foremost try to identify where is the
Hi
I would first and foremost try to identify where is the most time spend
during the query. One possibility is it just takes ramp up time for
executors to be available, if thats the case then maybe a dedicated yarn
queue may help, or using Spark thriftserver may help.
On Sun, Mar 15, 2020 at 11:
Been a while but I remember reading on Stack Overflow you can use a UDF as
a join condition to trick catalyst into not reshuffling the partitions, ie
use regular equality on the column you partitioned or bucketed by and your
custom comparer for the other columns. Never got around to try it out
houg
Hi,
I am also using Spark on Hive Metastore. The performance is much more better
esp. for larger datasets. I have the feeling that the performance is better if
I load the data into dataframes and do a join instead of doing direct join
within SparkSQL. But i can’t explain yet.
Any body experie
Hi All,
We have 10 tables in data warehouse (hdfs/hive) written using ORC format. We
are serving a usecase on top of that by joining 4-5 tables using Hive as of
now. But it is not fast as we wanted it to be, so we are thinking of using
spark for this use case.
Any suggestion on this ? Is it go