Hi everyone,
We'd like to discuss our proposal of Spark relational cache in this thread. Spark has native command for RDD caching, but the use of CACHE command in Spark SQL is limited, as we cannot use the cache cross session, as well as we have to rewrite queries by ourselves to make use of existing cache. To resolve this, we have done some initial work to do the following: 1. allow user to persist cache on HDFS in format of Parquet. 2. rewrite user queries in Catalyst, to utilize any existing cache (on HDFS or defined as in memory in current session) if possible. I have created a jira ticket(https://issues.apache.org/jira/browse/SPARK-26764) for this and attached an official SPIP document. Thanks for taking a look at the proposal. Best Regards, Daoyuan