Hi, all When we do perf test for Spark, we found that enabling table cache does not bring the expected speedup comparing to cloud-storage + parquet in many scenarios. We identified that the performance cost is brought by the fact that the current InMemoryRelation/InMemorytTableScanExec will traverse the complete cached table even for the highly selective queries. Comparing the parquet which utilizes file footer to skip the unnecessary parts of the file, the execution with cached table is slower.
We have filed JIRA in https://issues.apache.org/jira/browse/SPARK-22599 and have the corresponding PR in https://github.com/apache/spark/pull/19810 (design doc: https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing, which is also linked in JIRA/PR) Our performance evaluation suggests that we gain up to 41% speedup comparing to the current implementation ( https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing ) Please share your thoughts to help us to improve the optimization for the in-memory table scanning in Spark Best, Nan