I upgraded my Spark to 2.4.3 that allows using the storage layer Delta Lake <https://docs.delta.io/0.3.0/delta-intro.html> . Actually, I wish Databricks would have chosen a different name for it :)
Anyhow although most example of storage are on normal file system, (/tmp/<TABLE>), I managed to put data on hdfs itself. I assume this should work on any Hadoop Compatible File System (HCFS) like GCP buckets etc? According to the link above: Delta Lake <https://delta.io/> is an open source storage layer <https://github.com/delta-io/delta> that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. So in a nutshell with ACID compliance we have got an Oracle type DW on HDFS with snapshots. So I am thinking loud besides its compatibility with Spark (which is great), where I can use this product to give me strategic advantage? Also how much functional programming this will support. I gather once you created DataFrame on top of storage, windowing analytics etc can be used BAU. I am sure someone can explain this. Regards, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.