Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread ayan guha
Hi We used spark.sql to create a table using DELTA. We also have a hive metastore attached to the spark session. Hence, a table gets created in Hive metastore. We then tried to query the table from Hive. We faced following issues: 1. SERDE is SequenceFile, should have been Parquet 2. Scema f

Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread Gourav Sengupta
Hi Liwen, thanks a ton, I think that there is a difference between a storage class and metastore, just like there is a difference between a database and file system and coffee and cup. It will be wonderful to keep the focus on the fantastic opportunity that Delta creates for us :) Regards, Gour

Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread Liwen Sun
Hi James, Right now we don't have plans for having a catalog component as part of Delta Lake, but we are looking to support Hive metastore and also DDL commands in the near future. Thanks, Liwen On Thu, Jun 20, 2019 at 4:46 AM James Cotrotsios wrote: > Is there a plan to have a business catalo

Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread Li Gao
Lyft recently open sourced a data discovery tool called Amundsen that can serve many of the data catalog needs. https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9 https://github.com/lyft/amundsenmetadatalibrary You still need HMS to store the data schema though. On

Spark-cluster slowness

2019-06-20 Thread Amit Sharma
I have spark cluster on two data centers each. Cluster on spark cluster B is 6 times slower than cluster A. I ran the same job on both cluster and time difference is of 6 times. I used the same config and using spark 2.3.3. I checked that on spark UI it displays the slaves nodes but when i check u

Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread James Cotrotsios
Is there a plan to have a business catalog component for the Data Lake? If not how would someone make a proposal to create an open source project related to that. I would be interested in building out an open source data catalog that would use the Hive metadata store as a baseline for technical met