Is a Hive installation necessary for Spark SQL?

krchia Sun, 25 Apr 2021 06:37:42 -0700

Does it make sense to keep a Hive installation when your parquet files come
with a transactional metadata layer like Delta Lake / Apache Iceberg?


My understanding from this:
https://github.com/delta-io/delta/issues/85

is that Hive is no longer necessary other than discovering where the table
is stored. Hence, we can simply do something like:
```
df = spark.read.delta($LOCATION)
df.createOrReplaceTempView("myTable")
res = spark.sql("select * from myTable")
```
and this approach still gets all the benefits of having the metadata for
partition discovery / SQL optimization? With Delta, the Hive metastore
should only store a pointer from the table name to the path of the table,
and all other metadata will come from the Delta log, which will be processed
in Spark.

One reason i can think of keeping Hive is to keep track of other data
sources that don't necessarily have a Delta / Iceberg transactional metadata
layer. But i'm not sure if it's still worth it, are there any use cases i
might have missed out on keeping a Hive installation after migrating to
Delta / Iceberg?

Please correct me if i've used any terms wrongly.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Is a Hive installation necessary for Spark SQL?

Reply via email to