Hi everybody, As some of you may know, at Talend, we’ve been working for a while to add TPC-DS benchmark suite into Beam. We believe that having TPC-DS as a part of Beam testing workflow and release routine will help a community to detect quickly the performance regressions or improvements, identify missing or incorrect Beam SQL features and execute Beam SQL on different runtime environments with different runners.
What is TPC-DS? From TPC-DS specification document [1]: “TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system.” TPC-DS benchmark suite for Beam is implemented as a separate testing tool for Java SDK (like well known Nexmark benchmark suite) [2]. It supports a limited number of TPC-DS SQL queries for now (mostly because of limited SQL syntax support in Beam), CSV and Parquet as input data format, and it runs on Jenkins with three most popular Beam runners (Spark [3], Flink [4], Dataflow [5]). The job metrics are stored in InfluxDB and can be accessed though Grafana dashboards [6][7][8]. More details can be found in Beam documentation [9]. For sure, there are still plenty things to do, like adding new runners, support of other SDKs, data formats, etc - so, your contributions are very welcomed in any form. Though, at least for now, we already have a first working and automated version that can be used by community. Also, I’d like to thank everybody who worked on this improvement! — Alexey [1] https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp <https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp> [2] https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds <https://github.com/apache/beam/tree/master/sdks/java/testing/tpcds> [3] https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Spark/ <https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Spark/> [4] https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Flink/ <https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Flink/> [5] https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Dataflow/ <https://ci-beam.apache.org/job/beam_PostCommit_Java_Tpcds_Dataflow/> [6] http://metrics.beam.apache.org/d/tkqc0AdGk2/tpc-ds-spark-classic-new-sql?orgId=1 <http://metrics.beam.apache.org/d/tkqc0AdGk2/tpc-ds-spark-classic-new-sql?orgId=1> [7] http://metrics.beam.apache.org/d/8INnSY9Mv/tpc-ds-flink-sql?orgId=1 <http://metrics.beam.apache.org/d/8INnSY9Mv/tpc-ds-flink-sql?orgId=1> [8] http://metrics.beam.apache.org/d/tkqc0AdGk2/tpc-ds-spark-classic-new-sql?orgId=1 <http://metrics.beam.apache.org/d/tkqc0AdGk2/tpc-ds-spark-classic-new-sql?orgId=1> [9] https://beam.apache.org/documentation/sdks/java/testing/tpcds/ <https://beam.apache.org/documentation/sdks/java/testing/tpcds/>
