Hi, all

When we do perf test for Spark, we found that enabling table cache does not
bring the expected speedup comparing to cloud-storage + parquet in many
scenarios. We identified that the performance cost is brought by the fact
that the current InMemoryRelation/InMemorytTableScanExec will traverse the
complete cached table even for the highly selective queries. Comparing the
parquet which utilizes file footer to skip the unnecessary parts of the
file, the execution with cached table is slower.

We have filed JIRA in https://issues.apache.org/jira/browse/SPARK-22599 and
have the corresponding PR in https://github.com/apache/spark/pull/19810
(design doc:
https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing,
which is also linked in JIRA/PR)

Our performance evaluation suggests that we gain up to 41% speedup
comparing to the current implementation (
https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing
)

Please share your thoughts to help us to improve the optimization for the
in-memory table scanning in Spark

Best,

Nan

Reply via email to