Re: Is a Hive installation necessary for Spark SQL?

chia kang ren Sun, 25 Apr 2021 02:57:46 -0700

 Does it make sense to keep a Hive installation when your parquet files
come with a transactional metadata layer like Delta Lake / Apache Iceberg?

My understanding from this:
https://github.com/delta-io/delta/issues/85

is that Hive is no longer necessary in a Spark cluster other than
discovering where the table is stored. Hence, we can simply do something
like:
```
df = spark.read.delta($LOCATION)
df.createOrReplaceTempView("myTable")
res = spark.sql("select * from myTable")
```
and this approach still gets all the benefits of having the metadata for
partition discovery / SQL optimization? With Delta, the Hive metastore
should only store a pointer from the table name to the path of the table,
and all other metadata will come from the Delta log, which will be
processed in Spark.

One reason i can think of keeping Hive is to keep track of other data
sources that don't necessarily have a Delta / Iceberg transactional
metadata layer. But i'm not sure if it's still worth it, are there any use
cases i might have missed out on keeping a Hive installation after
migrating to Delta / Iceberg?

Please correct me if i've used any terms wrongly.

On Sun, Apr 25, 2021 at 5:42 PM chia kang ren <kangren.c...@gmail.com>
wrote:

> Does it make sense to keep a Hive installation when your parquet files
> come with a transactional metadata layer like Delta Lake / Apache Iceberg?
>
> My understanding from this:
> https://github.com/delta-io/delta/issues/85
>
> is that Hive is no longer necessary in a Spark cluster other than
> discovering where the table is stored. Hence, we can simply do something
> like:
> ```
> df = spark.read.delta($LOCATION)
> df.createOrReplaceTempView("myTable")
> res = spark.sql("select * from myTable")
> ```
> and this approach still gets all the benefits of having the metadata for
> partition discovery / SQL optimization? With Delta, the Hive metastore
> should only store a pointer from the table name to the path of the table,
> and all other metadata will come from the Delta log, which will be
> processed in Spark.
>
> One reason i can think of keeping Hive is to keep track of other data
> sources that don't necessarily have a Delta / Iceberg transactional
> metadata layer. But i'm not sure if it's still worth it, are there any use
> cases i might have missed out on keeping a Hive installation after
> migrating to Delta / Iceberg?
>
> Please correct me if i've used any terms wrongly.
>

Re: Is a Hive installation necessary for Spark SQL?

Reply via email to