This is an automated email from the ASF dual-hosted git repository. morningman pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push: new f0f3548dfe [regression](multi-catalog)add EMR cloud env test tools (#21788) f0f3548dfe is described below commit f0f3548dfee84160d224b0bfd57e5098efa258f3 Author: slothever <18522955+w...@users.noreply.github.com> AuthorDate: Fri Jul 28 09:45:10 2023 +0800 [regression](multi-catalog)add EMR cloud env test tools (#21788) add emr test tools for aliyun, huawei cloud, tencent cloud. --- tools/emr_storage_regression/README.md | 178 +++++++++++++++++++ tools/emr_storage_regression/default_emr_env.sh | 58 +++++++ tools/emr_storage_regression/emr_tools.sh | 192 +++++++++++++++++++++ .../ping_test/create_catalog_aliyun.sql | 12 ++ .../ping_test/create_catalog_aws.sql | 7 + .../ping_test/create_catalog_hw.sql | 5 + .../ping_test/create_catalog_tx.sql | 5 + .../ping_test/data/create_dlf_ping.sql | 9 + .../ping_test/data/create_hive_ping.sql | 9 + .../ping_test/data/create_spark_ping.sql | 9 + .../ping_test/data/data_for_dlf.sql | 7 + .../ping_test/data/data_for_hive.sql | 15 ++ .../ping_test/data/data_for_spark.sql | 10 ++ tools/emr_storage_regression/ping_test/ping.sql | 15 ++ .../ping_test/ping_aliyun.sql | 23 +++ tools/emr_storage_regression/ping_test/ping_poc.sh | 151 ++++++++++++++++ .../standard_set/analyze.sql | 69 ++++++++ .../standard_set/gen_spark_create_sql.sh | 38 ++++ .../gen_tbl/gen_clickbench_create_sql.sh | 153 ++++++++++++++++ .../standard_set/gen_tbl/gen_ssb_create_sql.sh | 166 ++++++++++++++++++ .../standard_set/gen_tbl/gen_tpch_create_sql.sh | 140 +++++++++++++++ .../standard_set/queries/clickbench_queries.sql | 43 +++++ .../standard_set/queries/ssb_flat_queries.sql | 13 ++ .../standard_set/queries/ssb_queries.sql | 13 ++ .../standard_set/queries/tpch_queries.sql | 22 +++ .../standard_set/run_queries.sh | 46 +++++ .../standard_set/run_standard_set.sh | 76 ++++++++ 27 files changed, 1484 insertions(+) diff --git a/tools/emr_storage_regression/README.md b/tools/emr_storage_regression/README.md new file mode 100644 index 0000000000..bdc6f4f8ba --- /dev/null +++ b/tools/emr_storage_regression/README.md @@ -0,0 +1,178 @@ +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Data Lake Regression Testing Tool For External Table + +Used to test the doris external table on object storage for cloud vendors + +> Supported storage formats: HDFS, Alibaba Cloud OSS, Tencent Cloud COS, Huawei Cloud OBS + +> Supported data lake table formats: Iceberg + +The following provides the example of the command line options: + +``` +sh tools/emr_storage_regression/emr_tools.sh --profile default_emr_env.sh +``` + +Or + +``` +sh tools/emr_storage_regression/emr_tools.sh --case CASE --endpoint ENDPOINT --region REGION --service SERVICE --ak AK --sk SK --host HOST --user USER --port PORT +``` + +The usage of each option is described below. + +## Connectivity Test + +When the `--case` option is set to `ping`, will check Doris's connectivity on EMR: + +- `--endpoint`, Object Storage Endpoint. + +- `--region`, Object Storage Region. + +- `--ak`, Object Storage Access Key. + +- `--sk`, Object Storage Secret Key. + +- `--host`, Doris Mysql Client IP. + +- `--user`, Doris Mysql Client Username. + +- `--port`, Doris Mysql Client Port. + +- `--service`, EMR cloud vendors: ali(Alibaba), hw(Huawei), tx(tencent). + +### Environment Variables + +Need modify the environment variable in `default_emr_env.sh`, the script will execute `source default_emr_env.sh` to make the environment variable take effect. + +If environment variables are configured, you can run the test script directly with the following command: + +``` +sh emr_tools.sh --profile default_emr_env.sh +``` + +### The Script Execution Steps For Connectivity Test + +1. Create Spark and Hive tables on EMR +2. Use Spark and Hive command lines to insert sample data +3. Doris creates the Catalog for connectivity test +4. Execute SQL for connectivity test: `ping.sql` + +### Alibaba Cloud + +``` +sh emr_tools.sh --profile default_emr_env.sh +``` + +Or + +Set `--service` to `ali`, and then test connectivity on Huawei Cloud. + +``` +sh emr_tools.sh --case ping --endpoint oss-cn-beijing-internal.aliyuncs.com --region cn-beijing --service ali --ak ak --sk sk --host 127.0.0.1 --user root --port 9030 > log +``` + +Alibaba Cloud EMR also supports testing connectivity for both Doris with DLF metadata and Doris on OSS-HDFS storage. + +- The DLF metadata connectivity test needs to be performed on the EMR cluster where the DLF serves as the metadata store, Default value of `DLF_ENDPOINT` is `datalake-vpc.cn-beijing.aliyuncs.com`, configured at ping_test/ping_poc.sh. + +- To test the OSS-HDFS storage connectivity, need to [enable the HDFS service on the OSS storage and configure](https://www.alibabacloud.com/help/en/e-mapreduce/latest/oss-hdfsnew), Default value of `JINDO_ENDPOINT` is `cn-beijing.oss-dls.aliyuncs.com`, configured at ping_test/ping_poc.sh. + +### Tencent Cloud + +``` +sh emr_tools.sh --profile default_emr_env.sh +``` + +Or + +Set `--service` to `tx`, and then test connectivity on Huawei Cloud. + +``` +sh emr_tools.sh --case ping --endpoint cos.ap-beijing.myqcloud.com --region ap-beijing --service tx --ak ak --sk sk --host 127.0.0.1 --user root --port 9030 > log +``` + +### Huawei Cloud + +``` +sh emr_tools.sh --profile default_emr_env.sh +``` + +Or + +Set `--service`to `hw`, and then test connectivity on Huawei Cloud. + +``` +sh emr_tools.sh --case ping --endpoint obs.cn-north-4.myhuaweicloud.com --region cn-north-4 --service hw --ak ak --sk sk --host 127.0.0.1 --user root --port 9030 > log +``` + +## Performance Testing on Standard Test Set + +When the `--case` option is set to `data_set`, will test the query performance of Doris external table: + +- `--test` test data set: ssb, ssb_flat, tpch, clickbench and all. Default `all`. + +- `--service`, EMR cloud vendors: ali(Alibaba), hw(Huawei), tx(tencent). + +- `--host`, Doris Mysql Client IP. + +- `--user`, Doris Mysql Client Username. + +- `--port`, Doris Mysql Client Port. + +### Environment Variables + +Just modify the above environment variable in `default_emr_env.sh`, the script will execute `source default_emr_env.sh` to make the environment variable take effect. + +If environment variables are configured, you can run the test script directly with the following command: + +``` +sh emr_tools.sh --profile default_emr_env.sh +``` + +### Prepare Data + +1. To run the standard test set using the `emr_tools.sh` script, you need to rewrite the object storage bucket specified by the `BUCKET` variable, and then prepare data in advance and put them under the bucket. The script will generate table creation statements based on the bucket. + +2. Now the `emr_tools.sh` script supports iceberg, parquet and orc data for ssb, ssb_flat, tpch, clickbench. + +### Execution Steps + +1. After the connectivity test, the Doris Catalog corresponding to the standard test set is created +2. Prepare the test set data based on the object storage bucket specified by the `BUCKET` variable +3. Generate Spark table creation statements and create Spark object storage tables on EMR +4. Create the spark table in the local HDFS directory: `hdfs:///benchmark-hdfs` +5. You can choose to analyze Doris tables ahead of time and manually execute the statements in `analyze.sql` in the Doris Catalog +6. Execute standard test set scripts: `run_standard_set.sh` + +### Standard data set: ssb, ssb_flat, tpch, clickbench + +- Full test. After executing the test command, Doris will run ssb, ssb_flat, tpch, clickbench tests in sequence, and the test results will include the cases on HDFS and on the object storage specified by `--service`. + +``` +sh emr_tools.sh --case data_set --service ali --host 127.0.0.1 --user root --port 9030 > log +``` + +- Specify a single test. `--test` option can be set to one of ssb, ssb_flat, tpch and clickbench. + +``` +sh emr_tools.sh --case data_set --test ssb --service ali --host 127.0.0.1 --user root --port 9030 > log +``` diff --git a/tools/emr_storage_regression/default_emr_env.sh b/tools/emr_storage_regression/default_emr_env.sh new file mode 100644 index 0000000000..0fae74e19b --- /dev/null +++ b/tools/emr_storage_regression/default_emr_env.sh @@ -0,0 +1,58 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# See emr_tools.sh +############################################################## + +# specified sevices: ali,hw,tx +export SERVICE=ali +# doris host +export HOST=127.0.0.1 +# doris user +export USER=root +# doris mysql cli port +export PORT=9030 + +# prepare endpoint,region,ak/sk +if [[ ${SERVICE} == 'ali' ]]; then + export CASE=ping + export AK=ak + export SK=sk + export ENDPOINT=oss-cn-beijing-internal.aliyuncs.com + export REGION=oss-cn-beijing + export HMS_META_URI="thrift://172.16.1.1:9083" + export HMS_WAREHOUSE=oss://benchmark-oss/user +elif [[ ${SERVICE} == 'hw' ]]; then + export CASE=ping + export AK=ak + export SK=sk + export ENDPOINT=obs.cn-north-4.myhuaweicloud.com + export REGION=cn-north-4 + export HMS_META_URI="thrift://node1:9083,thrift://node2:9083" + export HMS_WAREHOUSE=obs://datalake-bench/user + export BEELINE_URI="jdbc:hive2://192.168.0.1:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;hive.server2.proxy.user=hive" +elif [[ ${SERVICE} == 'tx' ]]; then + export CASE=ping + export AK=ak + export SK=sk + export ENDPOINT=cos.ap-beijing.mycloud.com + export REGION=ap-beijing + export HMS_META_URI="thrift://172.21.0.1:7004" + export HMS_WAREHOUSE=cosn://datalake-bench-cos-1308700295/user +fi diff --git a/tools/emr_storage_regression/emr_tools.sh b/tools/emr_storage_regression/emr_tools.sh new file mode 100644 index 0000000000..d1a7c0fe01 --- /dev/null +++ b/tools/emr_storage_regression/emr_tools.sh @@ -0,0 +1,192 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# This script is used to test EMR cloud service +# Usage: +# provide your env arguments in default_emr_env.sh +# sh emr_tools.sh --case ping --endpoint oss-cn-beijing-internal.aliyuncs.com --region cn-beijing --service ali --ak ak --sk sk +############################################################## + +set -eo pipefail + +usage() { + echo " +Usage: $0 <options> + Optional options: + [no option] + --case regression case runner: ping, data_set + --profile cloud credential profile + --ak cloud access key + --sk cloud secret key + --endpoint cloud endpoint + --region cloud region + --service cloud optional service provider: ali, tx, hw + --host doris mysql cli host, example: 127.0.0.1 + --user doris username, example: user + --port doris port, example: 9030 + Example: + sh emr_tools.sh --case ping --endpoint oss-cn-beijing-internal.aliyuncs.com --region cn-beijing --service ali --ak ak --sk sk + " + exit 1 +} + +if ! OPTS="$(getopt \ + -n "$0" \ + -o '' \ + -l 'case:' \ + -l 'profile:' \ + -l 'ak:' \ + -l 'sk:' \ + -l 'endpoint:' \ + -l 'region:' \ + -l 'service:' \ + -l 'host:' \ + -l 'user:' \ + -l 'port:' \ + -l 'test:' \ + -o 'h' \ + -- "$@")"; then + usage +fi +eval set -- "${OPTS}" + +while true; do + case "$1" in + --profile) + PROFILE="$2" + # can use custom profile: sh emr_tools.sh --profile default_emr_env.sh + if [[ -n "${PROFILE}" ]]; then + # example: "$(pwd)/default_emr_env.sh" + # shellcheck disable=SC1090 + source "${PROFILE}" + fi + shift 2 + break + ;; + --case) + CASE="$2" + shift 2 + ;; + --ak) + AK="$2" + shift 2 + ;; + --sk) + SK="$2" + shift 2 + ;; + --endpoint) + ENDPOINT="$2" + shift 2 + ;; + --region) + REGION="$2" + shift 2 + ;; + --test) + TEST_SET="$2" + shift 2 + ;; + --service) + SERVICE="$2" + shift 2 + ;; + --host) + HOST="$2" + shift 2 + ;; + --user) + USER="$2" + shift 2 + ;; + --port) + PORT="$2" + shift 2 + ;; + -h) + usage + ;; + --) + shift + break + ;; + *) + echo "$1" + echo "Internal error" + exit 1 + ;; + esac +done + +export FE_HOST=${HOST} +export USER=${USER} +export FE_QUERY_PORT=${PORT} + +if [[ ${CASE} == 'ping' ]]; then + if [[ ${SERVICE} == 'hw' ]]; then + # shellcheck disable=SC2269 + HMS_META_URI="${HMS_META_URI}" + # shellcheck disable=SC2269 + HMS_WAREHOUSE="${HMS_WAREHOUSE}" + # shellcheck disable=SC2269 + BEELINE_URI="${BEELINE_URI}" + elif [[ ${SERVICE} == 'ali' ]]; then + # shellcheck disable=SC2269 + HMS_META_URI="${HMS_META_URI}" + # shellcheck disable=SC2269 + HMS_WAREHOUSE="${HMS_WAREHOUSE}" + else + # [[ ${SERVICE} == 'tx' ]]; + # shellcheck disable=SC2269 + HMS_META_URI="${HMS_META_URI}" + # shellcheck disable=SC2269 + HMS_WAREHOUSE="${HMS_WAREHOUSE}" + fi + sh ping_test/ping_poc.sh "${ENDPOINT}" "${REGION}" "${SERVICE}" "${AK}" "${SK}" "${HMS_META_URI}" "${HMS_WAREHOUSE}" "${BEELINE_URI}" +elif [[ ${CASE} == 'data_set' ]]; then + if [[ ${SERVICE} == 'tx' ]]; then + BUCKET=cosn://datalake-bench-cos-1308700295 + elif [[ ${SERVICE} == 'ali' ]]; then + BUCKET=oss://benchmark-oss + fi + # gen table for spark + if ! sh stardard_set/gen_spark_create_sql.sh "${BUCKET}" obj; then + echo "Fail to generate spark obj table for test set" + exit 1 + fi + if ! sh stardard_set/gen_spark_create_sql.sh hdfs:///benchmark-hdfs hdfs; then + echo "Fail to generate spark hdfs table for test set, import hdfs data first" + exit 1 + fi + # FE_HOST=172.16.1.163 + # USER=root + # PORT=9035 + if [[ -z ${TEST_SET} ]]; then + TEST_SET='all' + fi + TYPE=hdfs sh stardard_set/run_standard_set.sh "${FE_HOST}" "${USER}" "${PORT}" hms_hdfs "${TEST_SET}" + TYPE=hdfs sh stardard_set/run_standard_set.sh "${FE_HOST}" "${USER}" "${PORT}" iceberg_hms "${TEST_SET}" + if [[ ${SERVICE} == 'tx' ]]; then + sh stardard_set/run_standard_set.sh "${FE_HOST}" "${USER}" "${PORT}" hms_cos "${TEST_SET}" + sh stardard_set/run_standard_set.sh "${FE_HOST}" "${USER}" "${PORT}" iceberg_hms_cos "${TEST_SET}" + elif [[ ${SERVICE} == 'ali' ]]; then + sh stardard_set/run_standard_set.sh "${FE_HOST}" "${USER}" "${PORT}" hms_oss "${TEST_SET}" + sh stardard_set/run_standard_set.sh "${FE_HOST}" "${USER}" "${PORT}" iceberg_hms_oss "${TEST_SET}" + fi +fi diff --git a/tools/emr_storage_regression/ping_test/create_catalog_aliyun.sql b/tools/emr_storage_regression/ping_test/create_catalog_aliyun.sql new file mode 100644 index 0000000000..e19a9672e8 --- /dev/null +++ b/tools/emr_storage_regression/ping_test/create_catalog_aliyun.sql @@ -0,0 +1,12 @@ +CREATE CATALOG IF NOT EXISTS hms_hdfs PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI" ); +CREATE CATALOG IF NOT EXISTS hms_oss PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI", "oss.secret_key" = "SK_INPUT", "oss.endpoint" = "ENDPOINT", "oss.access_key" = "AK_INPUT" ); +CREATE CATALOG IF NOT EXISTS hms_jindo PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI", "oss.secret_key" = "SK_INPUT", "oss.endpoint" = "JINDO_ENDPOINT", "oss.access_key" = "AK_INPUT", "oss.hdfs.enabled" = "true" ); + +CREATE CATALOG IF NOT EXISTS iceberg_hms PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI" ); +CREATE CATALOG IF NOT EXISTS iceberg_hms_oss PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI", "oss.secret_key" = "SK_INPUT", "oss.endpoint" = "ENDPOINT", "oss.access_key" = "AK_INPUT" ); +CREATE CATALOG IF NOT EXISTS iceberg_hms_jindo PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI", "oss.secret_key" = "SK_INPUT", "oss.endpoint" = "JINDO_ENDPOINT", "oss.access_key" = "AK_INPUT", "oss.hdfs.enabled" = "true" ); + +CREATE CATALOG IF NOT EXISTS dlf PROPERTIES( "type" = "hms", "hive.metastore.type" = "dlf", "dlf.proxy.mode" = "DLF_ONLY", "dlf.endpoint" = "DLF_ENDPOINT", "dlf.uid" = "217316283625971977", "dlf.access_key" = "AK_INPUT", "dlf.secret_key" = "SK_INPUT" ); +CREATE CATALOG IF NOT EXISTS dlf_jindo PROPERTIES( "type" = "hms", "hive.metastore.type" = "dlf", "dlf.proxy.mode" = "DLF_ONLY", "dlf.endpoint" = "DLF_ENDPOINT", "dlf.uid" = "217316283625971977", "dlf.access_key" = "AK_INPUT", "dlf.secret_key" = "SK_INPUT", "oss.hdfs.enabled" = "true" ); +CREATE CATALOG IF NOT EXISTS iceberg_dlf PROPERTIES ( "type"="iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.type" = "dlf", "dlf.endpoint" = "DLF_ENDPOINT", "dlf.region" = "cn-beijing", "dlf.proxy.mode" = "DLF_ONLY", "dlf.uid" = "217316283625971977", "dlf.access_key" = "AK_INPUT", "dlf.secret_key" = "SK_INPUT" ); +CREATE CATALOG IF NOT EXISTS iceberg_dlf_jindo PROPERTIES ( "type"="iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.type" = "dlf", "dlf.endpoint" = "DLF_ENDPOINT", "dlf.region" = "cn-beijing", "dlf.proxy.mode" = "DLF_ONLY", "dlf.uid" = "217316283625971977", "dlf.access_key" = "AK_INPUT", "dlf.secret_key" = "SK_INPUT", "oss.hdfs.enabled" = "true" ); \ No newline at end of file diff --git a/tools/emr_storage_regression/ping_test/create_catalog_aws.sql b/tools/emr_storage_regression/ping_test/create_catalog_aws.sql new file mode 100644 index 0000000000..28ef2aaf4b --- /dev/null +++ b/tools/emr_storage_regression/ping_test/create_catalog_aws.sql @@ -0,0 +1,7 @@ +CREATE CATALOG IF NOT EXISTS hms_hdfs PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI" ); +CREATE CATALOG IF NOT EXISTS hms_s3 PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI", "s3.secret_key" = "SK_INPUT", "s3.endpoint" = "ENDPOINT", "s3.access_key" = "AK_INPUT" ); + +CREATE CATALOG IF NOT EXISTS iceberg_hms PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI" ); +CREATE CATALOG IF NOT EXISTS iceberg_hms_s3 PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI", "s3.secret_key" = "SK_INPUT", "s3.endpoint" = "ENDPOINT", "s3.access_key" = "AK_INPUT" ); + +-- glue s3 \ No newline at end of file diff --git a/tools/emr_storage_regression/ping_test/create_catalog_hw.sql b/tools/emr_storage_regression/ping_test/create_catalog_hw.sql new file mode 100644 index 0000000000..84b9c4b777 --- /dev/null +++ b/tools/emr_storage_regression/ping_test/create_catalog_hw.sql @@ -0,0 +1,5 @@ +CREATE CATALOG IF NOT EXISTS hms_hdfs PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI" ); +CREATE CATALOG IF NOT EXISTS hms_obs PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI", "obs.secret_key" = "SK_INPUT", "obs.endpoint" = "ENDPOINT", "obs.access_key" = "AK_INPUT" ); + +CREATE CATALOG IF NOT EXISTS iceberg_hms PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI" ); +CREATE CATALOG IF NOT EXISTS iceberg_hms_obs PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI", "obs.secret_key" = "SK_INPUT", "obs.endpoint" = "ENDPOINT", "obs.access_key" = "AK_INPUT" ); diff --git a/tools/emr_storage_regression/ping_test/create_catalog_tx.sql b/tools/emr_storage_regression/ping_test/create_catalog_tx.sql new file mode 100644 index 0000000000..2f672736db --- /dev/null +++ b/tools/emr_storage_regression/ping_test/create_catalog_tx.sql @@ -0,0 +1,5 @@ +CREATE CATALOG IF NOT EXISTS hms_hdfs PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI" ); +CREATE CATALOG IF NOT EXISTS hms_cos PROPERTIES ( "type" = "hms", "hive.metastore.uris" = "META_URI", "cos.secret_key" = "SK_INPUT", "cos.endpoint" = "ENDPOINT", "cos.access_key" = "AK_INPUT" ); + +CREATE CATALOG IF NOT EXISTS iceberg_hms PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI" ); +CREATE CATALOG IF NOT EXISTS iceberg_hms_cos PROPERTIES ( "type" = "iceberg", "iceberg.catalog.type" = "hms", "hive.metastore.uris" = "META_URI", "cos.secret_key" = "SK_INPUT", "cos.endpoint" = "ENDPOINT", "cos.access_key" = "AK_INPUT" ); diff --git a/tools/emr_storage_regression/ping_test/data/create_dlf_ping.sql b/tools/emr_storage_regression/ping_test/data/create_dlf_ping.sql new file mode 100644 index 0000000000..b37b8085bf --- /dev/null +++ b/tools/emr_storage_regression/ping_test/data/create_dlf_ping.sql @@ -0,0 +1,9 @@ +CREATE DATABASE IF NOT EXISTS hive_dlf_db; +CREATE TABLE IF NOT EXISTS hive_dlf_db.types ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredP [...] +CREATE TABLE IF NOT EXISTS hive_dlf_db.types_one_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) PARTITIONED BY (dt STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apac [...] +CREATE TABLE IF NOT EXISTS hive_dlf_db.types_multi_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN ) PARTITIONED BY (dt STRING, hms_timstamp TIMESTAMP, hms_date DATE) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.ap [...] + +CREATE DATABASE IF NOT EXISTS hive_iceberg_db_dlf; +CREATE TABLE IF NOT EXISTS hive_iceberg_db_dlf.types ( hms_int INT, hms_smallint INT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char STRING, hms_varchar STRING, hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) STORED BY "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" TBLPROPERTIES ('iceberg.catalog'='dlf', 'format-version'= '2'); +CREATE TABLE IF NOT EXISTS hive_iceberg_db_dlf.types_one_part ( hms_int INT, hms_smallint INT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char STRING, hms_varchar STRING, hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) PARTITIONED BY (dt STRING) STORED BY "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" TBLPROPERTIES ('iceberg.catalog'='dlf', 'format-version'='2'); +CREATE TABLE IF NOT EXISTS hive_iceberg_db_dlf.types_multi_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN ) PARTITIONED BY (dt STRING, hms_timstamp TIMESTAMP, hms_date DATE) STORED BY "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" TBLPROPERTIES ('iceberg.catalog'='dlf', 'format-version'='2'); diff --git a/tools/emr_storage_regression/ping_test/data/create_hive_ping.sql b/tools/emr_storage_regression/ping_test/data/create_hive_ping.sql new file mode 100644 index 0000000000..54e306f729 --- /dev/null +++ b/tools/emr_storage_regression/ping_test/data/create_hive_ping.sql @@ -0,0 +1,9 @@ +CREATE DATABASE IF NOT EXISTS hive_hms_db; +CREATE TABLE IF NOT EXISTS hive_hms_db.types ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredP [...] +CREATE TABLE IF NOT EXISTS hive_hms_db.types_one_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) PARTITIONED BY (dt STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apac [...] +CREATE TABLE IF NOT EXISTS hive_hms_db.types_multi_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN ) PARTITIONED BY (dt STRING, hms_timstamp TIMESTAMP, hms_date DATE) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.ap [...] + +CREATE DATABASE IF NOT EXISTS hive_iceberg_db_hms; +CREATE TABLE IF NOT EXISTS hive_iceberg_db_hms.types ( hms_int INT, hms_smallint INT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char STRING, hms_varchar STRING, hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) STORED BY "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" TBLPROPERTIES ('format-version'='2'); +CREATE TABLE IF NOT EXISTS hive_iceberg_db_hms.types_one_part ( hms_int INT, hms_smallint INT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char STRING, hms_varchar STRING, hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) PARTITIONED BY (dt STRING) STORED BY "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" TBLPROPERTIES ( 'format-version'='2'); +CREATE TABLE IF NOT EXISTS hive_iceberg_db_hms.types_multi_part ( hms_int INT, hms_smallint INT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char STRING, hms_varchar STRING, hms_bool BOOLEAN ) PARTITIONED BY (dt STRING, hms_timstamp TIMESTAMP, hms_date DATE) STORED BY "org.apache.iceberg.mr.hive.HiveIcebergStorageHandler" TBLPROPERTIES ( 'format-version'='2'); diff --git a/tools/emr_storage_regression/ping_test/data/create_spark_ping.sql b/tools/emr_storage_regression/ping_test/data/create_spark_ping.sql new file mode 100644 index 0000000000..06b9cb6501 --- /dev/null +++ b/tools/emr_storage_regression/ping_test/data/create_spark_ping.sql @@ -0,0 +1,9 @@ +CREATE DATABASE IF NOT EXISTS spark_hms_db; +CREATE TABLE IF NOT EXISTS spark_hms_db.types ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) using parquet; +CREATE TABLE IF NOT EXISTS spark_hms_db.types_one_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) using parquet partitioned by (dt string); +CREATE TABLE IF NOT EXISTS spark_hms_db.types_multi_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN ) using parquet partitioned by (dt string, hms_timstamp TIMESTAMP, hms_date DATE); + +--CREATE DATABASE IF NOT EXISTS iceberg.spark_iceberg_db_hms; +--CREATE TABLE IF NOT EXISTS iceberg.spark_iceberg_db_hms.types ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) USING iceberg TBLPROPERTIES ( 'format-version'='2'); +--CREATE TABLE IF NOT EXISTS iceberg.spark_iceberg_db_hms.types_one_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN, hms_timstamp TIMESTAMP, hms_date DATE ) PARTITIONED BY (dt STRING) USING iceberg TBLPROPERTIES ( 'format-version'='2'); +--CREATE TABLE IF NOT EXISTS iceberg.spark_iceberg_db_hms.types_multi_part ( hms_int INT, hms_smallint SMALLINT, hms_bigint BIGINT, hms_double DOUBLE, hms_string STRING, hms_decimal DECIMAL(12,4), hms_char CHAR(50), hms_varchar VARCHAR(50), hms_bool BOOLEAN ) PARTITIONED BY (dt STRING, hms_timstamp TIMESTAMP, hms_date DATE) USING iceberg TBLPROPERTIES ( 'format-version'='2'); \ No newline at end of file diff --git a/tools/emr_storage_regression/ping_test/data/data_for_dlf.sql b/tools/emr_storage_regression/ping_test/data/data_for_dlf.sql new file mode 100644 index 0000000000..cf9574563b --- /dev/null +++ b/tools/emr_storage_regression/ping_test/data/data_for_dlf.sql @@ -0,0 +1,7 @@ +insert into hive_dlf_db.types values(123,34,3455,34.667754,"wastxali",234.1234,"a23f","1234vb",false,"2023-04-23 21:23:34.123","2023-04-23"); +insert into hive_dlf_db.types_one_part values(604,376,234,123.478,"aswwas",234.1234,"a23f","wsd",false,"2023-04-23 21:23:34.123","2023-04-23","2023-04-22"); +insert into hive_dlf_db.types_one_part values(223,22,234,234.500,"awsali",234.1234,"a23f","1234vb",true,"2023-04-22 21:21:34.123","2023-04-21","2023-04-24"); +insert into hive_dlf_db.types_multi_part values(1234,346,234,123.65567,"hwaws",234.1234,"a23f","1234vb",true,"2023-04-20","2023-04-21 19:23:34.123","2023-04-19"); +insert into hive_dlf_db.types_multi_part values(3212343,34,234,123.730,"hwaws",234.1234,"a23f","1234vb",true,"2023-04-20","2023-04-22 20:23:34.123","2023-04-22"); +insert into hive_dlf_db.types_multi_part values(355,22,990,123.324235,"hwaws",234.1234,"a23f","1234vb",true,"2023-04-21","2023-04-22 20:23:34.123","2023-04-22"); +insert into hive_dlf_db.types_multi_part values(23675,22,986,123.324235,"hwaws",234.1234,"a23f","1234vb",true,"2023-04-25","2023-04-21 19:23:34.123","2023-04-24"); \ No newline at end of file diff --git a/tools/emr_storage_regression/ping_test/data/data_for_hive.sql b/tools/emr_storage_regression/ping_test/data/data_for_hive.sql new file mode 100644 index 0000000000..a7599944ef --- /dev/null +++ b/tools/emr_storage_regression/ping_test/data/data_for_hive.sql @@ -0,0 +1,15 @@ +insert into hive_hms_db.types values(1123,5126,51,4534.63463,"wastxali",235.2351,"a23f","1234vb",false,"2023-04-23 21:23:34.123","2023-04-23"); +insert into hive_hms_db.types_one_part values(23621,23,234,345.12512356,"aswwas",525.2352,"a23f","wsd",false,"2023-04-23 21:23:34.123","2023-04-23","2023-04-22"); +insert into hive_hms_db.types_one_part values(11625,62,234,2347.6236,"awsali",546.2342,"a23f","1234vb",true,"2023-04-22 21:21:34.123","2023-04-21","2023-04-24"); +insert into hive_hms_db.types_multi_part values(123,66,234,13.1242,"hwaws",3463.4363,"a23f","1234vb",true,"2023-04-20","2023-04-21 19:23:34.123","2023-04-19"); +insert into hive_hms_db.types_multi_part values(324,77,234,123.163446,"hwaws",345.3413,"a23f","1234vb",true,"2023-04-20","2023-04-22 20:23:34.123","2023-04-22"); +insert into hive_hms_db.types_multi_part values(423,909,234,123657.512,"hwaws",234.2363,"a23f","1234vb",true,"2023-04-21","2023-04-22 20:23:34.123","2023-04-22"); +insert into hive_hms_db.types_multi_part values(343,712,234,1234.21451,"hwaws",3564.8945,"a23f","1234vb",true,"2023-04-25","2023-04-21 19:23:34.123","2023-04-24"); + +insert into hive_iceberg_db_hms.types values(123,22,234,123.324235,"wsawh",234.1234,"a23f","1234vb",false,"2023-04-23 21:23:34.123","2023-04-23"); +insert into hive_iceberg_db_hms.types_one_part values(223,22,234,123.324235,"aswwas",234.1234,"a23f","wsd",false,"2023-04-23 21:23:34.123","2023-04-23","2023-04-22"); +insert into hive_iceberg_db_hms.types_one_part values(223,22,234,123.324235,"awsali",234.1234,"a23f","1234vb",true,"2023-04-22 21:21:34.123","2023-04-21","2023-04-24"); +insert into hive_iceberg_db_hms.types_multi_part values(323,22,234,123.324235,"hwaws",234.1234,"a23f","1234vb",true,"2023-04-20","2023-04-21 19:23:34.123","2023-04-19"); +insert into hive_iceberg_db_hms.types_multi_part values(323,22,234,123.324235,"hwaws",234.1234,"a23f","1234vb",true,"2023-04-20","2023-04-22 20:23:34.123","2023-04-22"); +insert into hive_iceberg_db_hms.types_multi_part values(323,22,234,123.324235,"hwaws",234.1234,"a23f","1234vb",true,"2023-04-21","2023-04-22 20:23:34.123","2023-04-22"); +insert into hive_iceberg_db_hms.types_multi_part values(323,22,234,123.324235,"hwaws",234.1234,"a23f","1234vb",true,"2023-04-25","2023-04-21 19:23:34.123","2023-04-24"); diff --git a/tools/emr_storage_regression/ping_test/data/data_for_spark.sql b/tools/emr_storage_regression/ping_test/data/data_for_spark.sql new file mode 100644 index 0000000000..a1e2d1e996 --- /dev/null +++ b/tools/emr_storage_regression/ping_test/data/data_for_spark.sql @@ -0,0 +1,10 @@ +insert into spark_hms_db.types values(123,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",false,"2023-04-23 21:23:34.123","2023-04-23"); +insert into spark_hms_db.types_one_part values(223,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",false,"2023-04-23 21:23:34.123","2023-04-23 21:23:34","2023-04-23"); +insert into spark_hms_db.types_one_part values(223,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",false,"2023-04-23 21:23:34.123","2023-04-23 21:23:34","2023-04-23"); +insert into spark_hms_db.types_multi_part values(323,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",true,"2023-04-23","2023-04-23 21:23:34.123","2023-04-23"); +insert into spark_hms_db.types_multi_part values(323,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",true,"2023-04-23","2023-04-23 21:23:34.123","2023-04-23"); +insert into spark_hms_db.types_multi_part values(323,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",true,"2023-04-23","2023-04-23 21:23:34.123","2023-04-23"); + +--insert into iceberg.spark_iceberg_db_hms.types values(123,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",false,"2023-04-23 21:23:34.123","2023-04-23"); +--insert into iceberg.spark_iceberg_db_hms.types_one_part values(223,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",false,"2023-04-23 21:23:34.123","2023-04-23","2023-04-23"); +--insert into iceberg.spark_iceberg_db_hms.types_multi_part values(323,22,234,123.324235,"sawer",234.1234,"a23f","1234vb",true,"2023-04-23","2023-04-23 21:23:34.123","2023-04-23"); diff --git a/tools/emr_storage_regression/ping_test/ping.sql b/tools/emr_storage_regression/ping_test/ping.sql new file mode 100644 index 0000000000..01b75a15b2 --- /dev/null +++ b/tools/emr_storage_regression/ping_test/ping.sql @@ -0,0 +1,15 @@ +select * from spark_hms_db.types; +select * from spark_hms_db.types_one_part; +select * from spark_hms_db.types_multi_part; + +select * from hive_hms_db.types; +select * from hive_hms_db.types_one_part; +select * from hive_hms_db.types_multi_part; + +select * from spark_iceberg_db_hms.types; +select * from spark_iceberg_db_hms.types_one_part; +select * from spark_iceberg_db_hms.types_multi_part; + +select * from hive_iceberg_db_hms.types; +select * from hive_iceberg_db_hms.types_one_part; +select * from hive_iceberg_db_hms.types_multi_part; \ No newline at end of file diff --git a/tools/emr_storage_regression/ping_test/ping_aliyun.sql b/tools/emr_storage_regression/ping_test/ping_aliyun.sql new file mode 100644 index 0000000000..63e974a2e6 --- /dev/null +++ b/tools/emr_storage_regression/ping_test/ping_aliyun.sql @@ -0,0 +1,23 @@ +select * from spark_hms_db.types; +select * from spark_hms_db.types_one_part; +select * from spark_hms_db.types_multi_part; + +select * from hive_hms_db.types; +select * from hive_hms_db.types_one_part; +select * from hive_hms_db.types_multi_part; + +select * from spark_iceberg_db_hms.types; +select * from spark_iceberg_db_hms.types_one_part; +select * from spark_iceberg_db_hms.types_multi_part; + +select * from hive_iceberg_db_hms.types; +select * from hive_iceberg_db_hms.types_one_part; +select * from hive_iceberg_db_hms.types_multi_part; + +select * from hive_dlf_db.types; +select * from hive_dlf_db.types_one_part; +select * from hive_dlf_db.types_multi_part; + +select * from hive_iceberg_db_dlf.types; +select * from hive_iceberg_db_dlf.types_one_part; +select * from hive_iceberg_db_dlf.types_multi_part; diff --git a/tools/emr_storage_regression/ping_test/ping_poc.sh b/tools/emr_storage_regression/ping_test/ping_poc.sh new file mode 100644 index 0000000000..e9c23e1581 --- /dev/null +++ b/tools/emr_storage_regression/ping_test/ping_poc.sh @@ -0,0 +1,151 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# See emr_tools.sh +############################################################## + +## Step 1: create external table and import data +ENDPOINT=$1 +REGION=$2 +SERVICE=$3 +AK=$4 +SK=$5 +HMS_META_URI=$6 +HMS_WAREHOUSE=$7 +BEELINE_URI=$8 + +# set global env to local +# shellcheck disable=SC2269 +FE_HOST=${FE_HOST} +# shellcheck disable=SC2269 +FE_QUERY_PORT=${FE_QUERY_PORT} +# shellcheck disable=SC2269 +USER=${USER} + +DLF_ENDPOINT=datalake-vpc.cn-beijing.aliyuncs.com +JINDO_ENDPOINT=cn-beijing.oss-dls.aliyuncs.com + +if [[ -z ${HMS_WAREHOUSE} ]]; then + echo "Need warehouse for ${SERVICE}" +fi +cd "$(dirname "$0")" || gexit + +run_spark_create_sql() { + if [[ ${SERVICE} == 'ali' ]]; then + PARAM="--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.iceberg.catalog-impl=org.apache.iceberg.aliyun.dlf.hive.DlfCatalog \ + --conf spark.sql.catalog.iceberg.access.key.id=${AK} \ + --conf spark.sql.catalog.iceberg.access.key.secret=${SK} \ + --conf spark.sql.catalog.iceberg.dlf.endpoint=${DLF_ENDPOINT} \ + --conf spark.sql.catalog.iceberg.dlf.region-id=${REGION} \ + --conf spark.sql.catalog.hms=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.hms.type=hive \ + --conf spark.sql.defaultCatalog=hms \ + --conf spark.sql.catalog.hms.warehouse=${HMS_WAREHOUSE} \ + -f data/create_spark_ping.sql" 2>spark_create.log + elif [[ ${SERVICE} == 'tx' ]]; then + PARAM="--jars /usr/local/service/iceberg/iceberg-spark-runtime-3.2_2.12-0.13.1.jar \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ + --conf spark.sql.catalog.spark_catalog.type=hive \ + --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.local.type=hadoop \ + --conf spark.sql.catalog.local.warehouse=/usr/hive/warehouse \ + -f data/create_spark_ping.sql" 2>spark_create.log + elif [[ ${SERVICE} == 'hw' ]]; then + PARAM="-f data/create_spark_ping.sql" 2>spark_create.log + else + echo "Unknown service type: ${SERVICE}" + exit 1 + fi + eval spark-sql "${PARAM}" +} + +run_spark_create_sql +run_hive_create_sql() { + if [[ ${SERVICE} == 'hw' ]]; then + beeline -u "${BEELINE_URI}" -f data/create_hive_ping.sql 2>hive_create.log + elif [[ ${SERVICE} == 'ali' ]]; then + hive -f data/create_hive_ping.sql 2>hive_create.log + else + hive -f data/create_hive_ping.sql 2>hive_create.log + fi +} + +run_hive_create_sql + +## Step 2: make ping data +spark-sql -f data/data_for_spark.sql >>spark_data.log +hive -f data/data_for_hive.sql >>hive_data.log + +run_query() { + QUERY_NUM=1 + TRIES=2 + sql_file=$1 + catalog=$2 + while read -r query; do + echo -n "create catalog ${QUERY_NUM}," + for i in $(seq 1 "${TRIES}"); do + if [[ -n ${catalog} ]]; then + query="switch ${catalog};${query}" + fi + RES=$(mysql -vvv -h"${FE_HOST}" -u"${USER}" -P"${FE_QUERY_PORT}" -e "${query}") + echo -n "${RES}" + [[ "${i}" != "${TRIES}" ]] && echo -n "," + done + QUERY_NUM=$((QUERY_NUM + 1)) + done <"${sql_file}" +} + +## Step 3: create external catalog in doris +# shellcheck disable=SC2094 +case "${SERVICE}" in +ali) + sed -e 's#DLF_ENDPOINT#'"${DLF_ENDPOINT}"'#g' emr_catalog.sql >emr_catalog.sql + sed -e 's#JINDO_ENDPOINT#'"${JINDO_ENDPOINT}"'#g' emr_catalog.sql >emr_catalog.sql + sed -e 's#ENDPOINT#'"${ENDPOINT}"'#g' -e 's#META_URI#'"${HMS_META_URI}"'#g' -e 's#AK_INPUT#'"${AK}"'#g' -e 's#SK_INPUT#'"${SK}"'#g' create_catalog_aliyun.sql >emr_catalog.sql + ;; +tx) + sed -e 's#ENDPOINT#'"${ENDPOINT}"'#g' -e 's#META_URI#'"${HMS_META_URI}"'#g' -e 's#AK_INPUT#'"${AK}"'#g' -e 's#SK_INPUT#'"${SK}"'#g' create_catalog_tx.sql >emr_catalog.sql + ;; +aws) + sed -e 's#ENDPOINT#'"${ENDPOINT}"'#g' -e 's#META_URI#'"${HMS_META_URI}"'#g' -e 's#AK_INPUT#'"${AK}"'#g' -e 's#SK_INPUT#'"${SK}"'#g' create_catalog_aws.sql >emr_catalog.sql + ;; +hw) + sed -e 's#ENDPOINT#'"${ENDPOINT}"'#g' -e 's#META_URI#'"${HMS_META_URI}"'#g' -e 's#AK_INPUT#'"${AK}"'#g' -e 's#SK_INPUT#'"${SK}"'#g' create_catalog_hw.sql >emr_catalog.sql + ;; +*) + echo "Internal error" + exit 1 + ;; +esac + +run_query emr_catalog.sql + +## Step 4: query ping +EMR_CATALOG=$(awk '{print $6}' emr_catalog.sql) +# shellcheck disable=SC2116 +# required echo here, or the EMR_CATALOG will not be split. +for c in $(echo "${EMR_CATALOG}"); do + if [[ ${SERVICE} == 'ali' ]]; then + run_query ping_aliyun.sql "${c}" + fi + run_query ping.sql "${c}" +done diff --git a/tools/emr_storage_regression/standard_set/analyze.sql b/tools/emr_storage_regression/standard_set/analyze.sql new file mode 100644 index 0000000000..7a0e4ef746 --- /dev/null +++ b/tools/emr_storage_regression/standard_set/analyze.sql @@ -0,0 +1,69 @@ +analyze table ssb100_parquet.customer; +analyze table ssb100_parquet.dates; +analyze table ssb100_parquet.lineorder; +analyze table ssb100_parquet.lineorder_flat; +analyze table ssb100_parquet.part; +analyze table ssb100_parquet.supplier; + +analyze table ssb100_orc.customer; +analyze table ssb100_orc.dates; +analyze table ssb100_orc.lineorder; +analyze table ssb100_orc.lineorder_flat; +analyze table ssb100_orc.part; +analyze table ssb100_orc.supplier; + +analyze table tpch100_parquet.customer; +analyze table tpch100_parquet.lineitem; +analyze table tpch100_parquet.nation; +analyze table tpch100_parquet.orders; +analyze table tpch100_parquet.part; +analyze table tpch100_parquet.partsupp; +analyze table tpch100_parquet.region; +analyze table tpch100_parquet.supplier; + +analyze table tpch100_orc.customer; +analyze table tpch100_orc.lineitem; +analyze table tpch100_orc.nation; +analyze table tpch100_orc.orders; +analyze table tpch100_orc.part; +analyze table tpch100_orc.partsupp; +analyze table tpch100_orc.region; +analyze table tpch100_orc.supplier; + +analyze table clickbench_orc.hits; +analyze table clickbench_parquet.hits; + +analyze table ssb100_parquet_hdfs.customer; +analyze table ssb100_parquet_hdfs.dates; +analyze table ssb100_parquet_hdfs.lineorder; +analyze table ssb100_parquet_hdfs.lineorder_flat; +analyze table ssb100_parquet_hdfs.part; +analyze table ssb100_parquet_hdfs.supplier; + +analyze table ssb100_orc_hdfs.customer; +analyze table ssb100_orc_hdfs.dates; +analyze table ssb100_orc_hdfs.lineorder; +analyze table ssb100_orc_hdfs.lineorder_flat; +analyze table ssb100_orc_hdfs.part; +analyze table ssb100_orc_hdfs.supplier; + +analyze table tpch100_orc_hdfs.customer; +analyze table tpch100_orc_hdfs.lineitem; +analyze table tpch100_orc_hdfs.nation; +analyze table tpch100_orc_hdfs.orders; +analyze table tpch100_orc_hdfs.part; +analyze table tpch100_orc_hdfs.partsupp; +analyze table tpch100_orc_hdfs.region; +analyze table tpch100_orc_hdfs.supplier; + +analyze table tpch100_parquet_hdfs.customer; +analyze table tpch100_parquet_hdfs.lineitem; +analyze table tpch100_parquet_hdfs.nation; +analyze table tpch100_parquet_hdfs.orders; +analyze table tpch100_parquet_hdfs.part; +analyze table tpch100_parquet_hdfs.partsupp; +analyze table tpch100_parquet_hdfs.region; +analyze table tpch100_parquet_hdfs.supplier; + +analyze table clickbench_hdfs_orc.hits; +analyze table clickbench_hdfs_parquet.hits; diff --git a/tools/emr_storage_regression/standard_set/gen_spark_create_sql.sh b/tools/emr_storage_regression/standard_set/gen_spark_create_sql.sh new file mode 100644 index 0000000000..e70e67e43e --- /dev/null +++ b/tools/emr_storage_regression/standard_set/gen_spark_create_sql.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# See emr_tools.sh +############################################################## + +# shellcheck disable=SC2129 +BUCKET=$1 +TYPE=$2 +cd "$(dirname "$0")" || exit +sh gen_tbl/gen_ssb_create_sql.sh "${BUCKET}"/ssb/ssb100_orc ssb100_orc_"${TYPE}" orc >create_"${TYPE}".sql +sh gen_tbl/gen_ssb_create_sql.sh "${BUCKET}"/ssb/ssb100_parquet ssb100_parquet_"${TYPE}" parquet >>create_"${TYPE}".sql +# tpch +sh gen_tbl/gen_tpch_create_sql.sh "${BUCKET}"/tpch/tpch100_orc tpch100_orc_"${TYPE}" orc >>create_"${TYPE}".sql +sh gen_tbl/gen_tpch_create_sql.sh "${BUCKET}"/tpch/tpch100_parquet tpch100_parquet_"${TYPE}" parquet >>create_"${TYPE}".sql +# clickbench +sh gen_tbl/gen_clickbench_create_sql.sh "${BUCKET}"/clickbench/hits_parquet clickbench_parquet_"${TYPE}" parquet >>create_"${TYPE}".sql +sh gen_tbl/gen_clickbench_create_sql.sh "${BUCKET}"/clickbench/hits_orc clickbench_orc_"${TYPE}" orc >>create_"${TYPE}".sql +# iceberg +# sh gen_tbl/gen_ssb_create_sql.sh oss://benchmark-oss/ssb/ssb100_iceberg ssb100_iceberg iceberg >> create_"${TYPE}".sql +# sh gen_tbl/gen_tpch_create_sql.sh oss://benchmark-oss/tpch/tpch100_iceberg tpch100_iceberg iceberg >> create_"${TYPE}".sql +# sh gen_tbl/gen_clickbench_create_sql.sh oss://benchmark-oss/clickbench/hits_iceberg clickbench_iceberg_hdfs >> create_"${TYPE}".sql diff --git a/tools/emr_storage_regression/standard_set/gen_tbl/gen_clickbench_create_sql.sh b/tools/emr_storage_regression/standard_set/gen_tbl/gen_clickbench_create_sql.sh new file mode 100644 index 0000000000..90e57624ed --- /dev/null +++ b/tools/emr_storage_regression/standard_set/gen_tbl/gen_clickbench_create_sql.sh @@ -0,0 +1,153 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# See emr_tools.sh +############################################################## + +if [[ -z "$1" ]]; then + echo 'the first argument is database location' + exit +else + db_loc=$1 +fi + +if [[ -z "$2" ]]; then + echo 'the second argument is database name' + exit +else + db=$2 +fi +if [[ -z "$3" ]]; then + format=parquet +else + format=$3 +fi +# shellcheck disable=SC2016 +echo ' +CREATE DATABASE IF NOT EXISTS '"${db}"'; +USE '"${db}"'; +CREATE TABLE IF NOT EXISTS `hits`( + `WatchID` BIGINT, + `JavaEnable` SMALLINT, + `Title` STRING, + `GoodEvent` SMALLINT, + `EventTime` TIMESTAMP, + `EventDate` DATE, + `CounterID` INT, + `ClientIP` INT, + `RegionID` INT, + `UserID` BIGINT, + `CounterClass` SMALLINT, + `OS` SMALLINT, + `UserAgent` SMALLINT, + `URL` STRING, + `Referer` STRING, + `IsRefresh` SMALLINT, + `RefererCategoryID` SMALLINT, + `RefererRegionID` INT, + `URLCategoryID` SMALLINT, + `URLRegionID` INT, + `ResolutionWidth` SMALLINT, + `ResolutionHeight` SMALLINT, + `ResolutionDepth` SMALLINT, + `FlashMajor` SMALLINT, + `FlashMinor` SMALLINT, + `FlashMinor2` STRING, + `NetMajor` SMALLINT, + `NetMinor` SMALLINT, + `UserAgentMajor` SMALLINT, + `UserAgentMinor` STRING, + `CookieEnable` SMALLINT, + `JavascriptEnable` SMALLINT, + `IsMobile` SMALLINT, + `MobilePhone` SMALLINT, + `MobilePhoneModel` STRING, + `Params` STRING, + `IPNetworkID` INT, + `TraficSourceID` SMALLINT, + `SearchEngineID` SMALLINT, + `SearchPhrase` STRING, + `AdvEngineID` SMALLINT, + `IsArtifical` SMALLINT, + `WindowClientWidth` SMALLINT, + `WindowClientHeight` SMALLINT, + `ClientTimeZone` SMALLINT, + `ClientEventTime` TIMESTAMP, + `SilverlightVersion1` SMALLINT, + `SilverlightVersion2` SMALLINT, + `SilverlightVersion3` INT, + `SilverlightVersion4` SMALLINT, + `PageCharset` STRING, + `CodeVersion` INT, + `IsLink` SMALLINT, + `IsDownload` SMALLINT, + `IsNotBounce` SMALLINT, + `FUniqID` BIGINT, + `OriginalURL` STRING, + `HID` INT, + `IsOldCounter` SMALLINT, + `IsEvent` SMALLINT, + `IsParameter` SMALLINT, + `DontCountHits` SMALLINT, + `WithHash` SMALLINT, + `HitColor` STRING, + `LocalEventTime` TIMESTAMP, + `Age` SMALLINT, + `Sex` SMALLINT, + `Income` SMALLINT, + `Interests` SMALLINT, + `Robotness` SMALLINT, + `RemoteIP` INT, + `WindowName` INT, + `OpenerName` INT, + `HistoryLength` SMALLINT, + `BrowserLanguage` STRING, + `BrowserCountry` STRING, + `SocialNetwork` STRING, + `SocialAction` STRING, + `HTTPError` SMALLINT, + `SendTiming` INT, + `DNSTiming` INT, + `ConnectTiming` INT, + `ResponseStartTiming` INT, + `ResponseEndTiming` INT, + `FetchTiming` INT, + `SocialSourceNetworkID` SMALLINT, + `SocialSourcePage` STRING, + `ParamPrice` BIGINT, + `ParamOrderID` STRING, + `ParamCurrency` STRING, + `ParamCurrencyID` SMALLINT, + `OpenstatServiceName` STRING, + `OpenstatCampaignID` STRING, + `OpenstatAdID` STRING, + `OpenstatSourceID` STRING, + `UTMSource` STRING, + `UTMMedium` STRING, + `UTMCampaign` STRING, + `UTMContent` STRING, + `UTMTerm` STRING, + `FromTag` STRING, + `HasGCLID` SMALLINT, + `RefererHash` BIGINT, + `URLHash` BIGINT, + `CLID` INT) +USING '"${format}"' +LOCATION "'"${db_loc}"'"; +' diff --git a/tools/emr_storage_regression/standard_set/gen_tbl/gen_ssb_create_sql.sh b/tools/emr_storage_regression/standard_set/gen_tbl/gen_ssb_create_sql.sh new file mode 100644 index 0000000000..ae4b993372 --- /dev/null +++ b/tools/emr_storage_regression/standard_set/gen_tbl/gen_ssb_create_sql.sh @@ -0,0 +1,166 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# See emr_tools.sh +############################################################## + +if [[ -z "$1" ]]; then + echo 'the first argument is database location' + exit +else + db_loc=$1 +fi + +if [[ -z "$2" ]]; then + echo 'the second argument is database name' + exit +else + db=$2 +fi + +if [[ -z "$3" ]]; then + format=parquet +else + format=$3 +fi +# shellcheck disable=SC2016 +echo ' +CREATE DATABASE IF NOT EXISTS '"${db}"'; +USE '"${db}"'; + +CREATE TABLE IF NOT EXISTS `customer`( + `c_custkey` BIGINT COMMENT "", + `c_name` VARCHAR(26) COMMENT "", + `c_address` VARCHAR(41) COMMENT "", + `c_city` VARCHAR(11) COMMENT "", + `c_nation` VARCHAR(16) COMMENT "", + `c_region` VARCHAR(13) COMMENT "", + `c_phone` VARCHAR(16) COMMENT "", + `c_mktsegment` VARCHAR(11) COMMENT "") +USING '"${format}"' +LOCATION "'"${db_loc}"/customer'"; + +CREATE TABLE IF NOT EXISTS `dates`( + `d_datekey` BIGINT COMMENT "", + `d_date` VARCHAR(20) COMMENT "", + `d_dayofweek` VARCHAR(10) COMMENT "", + `d_month` VARCHAR(11) COMMENT "", + `d_year` BIGINT COMMENT "", + `d_yearmonthnum` BIGINT COMMENT "", + `d_yearmonth` VARCHAR(9) COMMENT "", + `d_daynuminweek` BIGINT COMMENT "", + `d_daynuminmonth` BIGINT COMMENT "", + `d_daynuminyear` BIGINT COMMENT "", + `d_monthnuminyear` BIGINT COMMENT "", + `d_weeknuminyear` BIGINT COMMENT "", + `d_sellingseason` VARCHAR(14) COMMENT "", + `d_lastdayinweekfl` BIGINT COMMENT "", + `d_lastdayinmonthfl` BIGINT COMMENT "", + `d_holidayfl` BIGINT COMMENT "", + `d_weekdayfl` BIGINT COMMENT "") +USING '"${format}"' +LOCATION "'"${db_loc}"/dates'"; + +CREATE TABLE IF NOT EXISTS `lineorder`( + `lo_orderkey` BIGINT COMMENT "", + `lo_linenumber` BIGINT COMMENT "", + `lo_custkey` BIGINT COMMENT "", + `lo_partkey` BIGINT COMMENT "", + `lo_suppkey` BIGINT COMMENT "", + `lo_orderdate` BIGINT COMMENT "", + `lo_orderpriority` VARCHAR(16) COMMENT "", + `lo_shippriority` BIGINT COMMENT "", + `lo_quantity` BIGINT COMMENT "", + `lo_extendedprice` BIGINT COMMENT "", + `lo_ordtotalprice` BIGINT COMMENT "", + `lo_discount` BIGINT COMMENT "", + `lo_revenue` BIGINT COMMENT "", + `lo_supplycost` BIGINT COMMENT "", + `lo_tax` BIGINT COMMENT "", + `lo_commitdate` BIGINT COMMENT "", + `lo_shipmode` VARCHAR(11) COMMENT "") +USING '"${format}"' +LOCATION "'"${db_loc}"/lineorder'"; + +CREATE TABLE IF NOT EXISTS `part`( + `p_partkey` BIGINT COMMENT "", + `p_name` VARCHAR(23) COMMENT "", + `p_mfgr` VARCHAR(7) COMMENT "", + `p_category` VARCHAR(8) COMMENT "", + `p_brand` VARCHAR(10) COMMENT "", + `p_color` VARCHAR(12) COMMENT "", + `p_type` VARCHAR(26) COMMENT "", + `p_size` BIGINT COMMENT "", + `p_container` VARCHAR(11) COMMENT "") +USING '"${format}"' +LOCATION "'"${db_loc}"/part'"; + +CREATE TABLE IF NOT EXISTS `supplier`( + `s_suppkey` BIGINT COMMENT "", + `s_name` VARCHAR(26) COMMENT "", + `s_address` VARCHAR(26) COMMENT "", + `s_city` VARCHAR(11) COMMENT "", + `s_nation` VARCHAR(16) COMMENT "", + `s_region` VARCHAR(13) COMMENT "", + `s_phone` VARCHAR(16) COMMENT "") +USING '"${format}"' +LOCATION "'"${db_loc}"/supplier'"; + +CREATE TABLE IF NOT EXISTS `lineorder_flat` ( + `lo_orderdate` BIGINT COMMENT "", + `lo_orderkey` BIGINT COMMENT "", + `lo_linenumber` TINYINT COMMENT "", + `lo_custkey` BIGINT COMMENT "", + `lo_partkey` BIGINT COMMENT "", + `lo_suppkey` BIGINT COMMENT "", + `lo_orderpriority` VARCHAR(100) COMMENT "", + `lo_shippriority` TINYINT COMMENT "", + `lo_quantity` TINYINT COMMENT "", + `lo_extendedprice` BIGINT COMMENT "", + `lo_ordtotalprice` BIGINT COMMENT "", + `lo_discount` TINYINT COMMENT "", + `lo_revenue` BIGINT COMMENT "", + `lo_supplycost` BIGINT COMMENT "", + `lo_tax` TINYINT COMMENT "", + `lo_commitdate` BIGINT COMMENT "", + `lo_shipmode` VARCHAR(100) COMMENT "", + `c_name` VARCHAR(100) COMMENT "", + `c_address` VARCHAR(100) COMMENT "", + `c_city` VARCHAR(100) COMMENT "", + `c_nation` VARCHAR(100) COMMENT "", + `c_region` VARCHAR(100) COMMENT "", + `c_phone` VARCHAR(100) COMMENT "", + `c_mktsegment` VARCHAR(100) COMMENT "", + `s_name` VARCHAR(100) COMMENT "", + `s_address` VARCHAR(100) COMMENT "", + `s_city` VARCHAR(100) COMMENT "", + `s_nation` VARCHAR(100) COMMENT "", + `s_region` VARCHAR(100) COMMENT "", + `s_phone` VARCHAR(100) COMMENT "", + `p_name` VARCHAR(100) COMMENT "", + `p_mfgr` VARCHAR(100) COMMENT "", + `p_category` VARCHAR(100) COMMENT "", + `p_brand` VARCHAR(100) COMMENT "", + `p_color` VARCHAR(100) COMMENT "", + `p_type` VARCHAR(100) COMMENT "", + `p_size` TINYINT COMMENT "", + `p_container` VARCHAR(100) COMMENT "") +USING '"${format}"' +LOCATION "'"${db_loc}"/lineorder_flat'"; +' diff --git a/tools/emr_storage_regression/standard_set/gen_tbl/gen_tpch_create_sql.sh b/tools/emr_storage_regression/standard_set/gen_tbl/gen_tpch_create_sql.sh new file mode 100644 index 0000000000..7905d2a940 --- /dev/null +++ b/tools/emr_storage_regression/standard_set/gen_tbl/gen_tpch_create_sql.sh @@ -0,0 +1,140 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# See emr_tools.sh +############################################################## + +if [[ -z "$1" ]]; then + echo 'the first argument is database location' + exit +else + db_loc=$1 +fi + +if [[ -z "$2" ]]; then + echo 'the second argument is database name' + exit +else + db=$2 +fi + +if [[ -z "$3" ]]; then + format=parquet +else + format=$3 +fi +# shellcheck disable=SC2016 +echo ' +CREATE DATABASE IF NOT EXISTS '"${db}"' ; +USE '"${db}"'; + +CREATE TABLE IF NOT EXISTS `customer`( + `c_custkey` int, + `c_name` string, + `c_address` string, + `c_nationkey` int, + `c_phone` string, + `c_acctbal` decimal(12,2), + `c_mktsegment` string, + `c_comment` string) +USING '"${format}"' +LOCATION "'"${db_loc}"/customer'"; + +CREATE TABLE IF NOT EXISTS `lineitem`( + `l_orderkey` int, + `l_partkey` int, + `l_suppkey` int, + `l_linenumber` int, + `l_quantity` decimal(12,2), + `l_extendedprice` decimal(12,2), + `l_discount` decimal(12,2), + `l_tax` decimal(12,2), + `l_returnflag` string, + `l_linestatus` string, + `l_shipdate` date, + `l_commitdate` date, + `l_receiptdate` date, + `l_shipinstruct` string, + `l_shipmode` string, + `l_comment` string) +USING '"${format}"' +LOCATION "'"${db_loc}"/lineitem'"; + +CREATE TABLE IF NOT EXISTS `nation`( + `n_nationkey` int, + `n_name` string, + `n_regionkey` int, + `n_comment` string) +USING '"${format}"' +LOCATION "'"${db_loc}"/nation'"; + +CREATE TABLE IF NOT EXISTS `orders`( + `o_orderkey` int, + `o_custkey` int, + `o_orderstatus` string, + `o_totalprice` decimal(12,2), + `o_orderdate` date, + `o_orderpriority` string, + `o_clerk` string, + `o_shippriority` int, + `o_comment` string) +USING '"${format}"' +LOCATION "'"${db_loc}"/orders'"; + +CREATE TABLE IF NOT EXISTS `part`( + `p_partkey` int, + `p_name` string, + `p_mfgr` string, + `p_brand` string, + `p_type` string, + `p_size` int, + `p_container` string, + `p_retailprice` decimal(12,2), + `p_comment` string) +USING '"${format}"' +LOCATION "'"${db_loc}"/part'"; + +CREATE TABLE IF NOT EXISTS `partsupp`( + `ps_partkey` int, + `ps_suppkey` int, + `ps_availqty` int, + `ps_supplycost` decimal(12,2), + `ps_comment` string) +USING '"${format}"' +LOCATION "'"${db_loc}"/partsupp'"; + +CREATE TABLE IF NOT EXISTS `region` ( + `r_regionkey` int, + `r_name` string, + `r_comment` string) +USING '"${format}"' +LOCATION "'"${db_loc}"/region'"; + +CREATE TABLE IF NOT EXISTS `supplier`( + `s_suppkey` int, + `s_name` string, + `s_address` string, + `s_nationkey` int, + `s_phone` string, + `s_acctbal` decimal(12,2), + `s_comment` string) +USING '"${format}"' +LOCATION "'"${db_loc}"/supplier'"; + +' diff --git a/tools/emr_storage_regression/standard_set/queries/clickbench_queries.sql b/tools/emr_storage_regression/standard_set/queries/clickbench_queries.sql new file mode 100644 index 0000000000..28dbd29dbd --- /dev/null +++ b/tools/emr_storage_regression/standard_set/queries/clickbench_queries.sql @@ -0,0 +1,43 @@ +SELECT COUNT(*) FROM hits; +SELECT COUNT(*) FROM hits WHERE AdvEngineID <> 0; +SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits; +SELECT AVG(UserID) FROM hits; +SELECT COUNT(DISTINCT UserID) FROM hits; +SELECT COUNT(DISTINCT SearchPhrase) FROM hits; +SELECT MIN(EventDate), MAX(EventDate) FROM hits; +SELECT AdvEngineID, COUNT(*) FROM hits WHERE AdvEngineID <> 0 GROUP BY AdvEngineID ORDER BY COUNT(*) DESC; +SELECT RegionID, COUNT(DISTINCT UserID) AS u FROM hits GROUP BY RegionID ORDER BY u DESC LIMIT 10; +SELECT RegionID, SUM(AdvEngineID), COUNT(*) AS c, AVG(ResolutionWidth), COUNT(DISTINCT UserID) FROM hits GROUP BY RegionID ORDER BY c DESC LIMIT 10; +SELECT MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhoneModel ORDER BY u DESC LIMIT 10; +SELECT MobilePhone, MobilePhoneModel, COUNT(DISTINCT UserID) AS u FROM hits WHERE MobilePhoneModel <> '' GROUP BY MobilePhone, MobilePhoneModel ORDER BY u DESC LIMIT 10; +SELECT SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10; +SELECT SearchPhrase, COUNT(DISTINCT UserID) AS u FROM hits WHERE SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY u DESC LIMIT 10; +SELECT SearchEngineID, SearchPhrase, COUNT(*) AS c FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, SearchPhrase ORDER BY c DESC LIMIT 10; +SELECT UserID, COUNT(*) FROM hits GROUP BY UserID ORDER BY COUNT(*) DESC LIMIT 10; +SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase ORDER BY COUNT(*) DESC LIMIT 10; +SELECT UserID, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, SearchPhrase LIMIT 10; +SELECT UserID, extract(minute FROM EventTime) AS m, SearchPhrase, COUNT(*) FROM hits GROUP BY UserID, m, SearchPhrase ORDER BY COUNT(*) DESC LIMIT 10; +SELECT UserID FROM hits WHERE UserID = 435090932899640449; +SELECT COUNT(*) FROM hits WHERE URL LIKE '%google%'; +SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10; +SELECT SearchPhrase, MIN(URL), MIN(Title), COUNT(*) AS c, COUNT(DISTINCT UserID) FROM hits WHERE Title LIKE '%Google%' AND URL NOT LIKE '%.google.%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10; +SELECT * FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10; +SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY EventTime LIMIT 10; +SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY SearchPhrase LIMIT 10; +SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY EventTime, SearchPhrase LIMIT 10; +SELECT CounterID, AVG(length(URL)) AS l, COUNT(*) AS c FROM hits WHERE URL <> '' GROUP BY CounterID HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; +SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM hits WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; +SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), SUM(ResolutionWidth + 3), SUM(ResolutionWidth + 4), SUM(ResolutionWidth + 5), SUM(ResolutionWidth + 6), SUM(ResolutionWidth + 7), SUM(ResolutionWidth + 8), SUM(ResolutionWidth + 9), SUM(ResolutionWidth + 10), SUM(ResolutionWidth + 11), SUM(ResolutionWidth + 12), SUM(ResolutionWidth + 13), SUM(ResolutionWidth + 14), SUM(ResolutionWidth + 15), SUM(ResolutionWidth + 16), SUM(ResolutionWidth + 17), SUM(Resolutio [...] +SELECT SearchEngineID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase <> '' GROUP BY SearchEngineID, ClientIP ORDER BY c DESC LIMIT 10; +SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits WHERE SearchPhrase <> '' GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10; +SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10; +SELECT URL, COUNT(*) AS c FROM hits GROUP BY URL ORDER BY c DESC LIMIT 10; +SELECT 1, URL, COUNT(*) AS c FROM hits GROUP BY 1, URL ORDER BY c DESC LIMIT 10; +SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, COUNT(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10; +SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND URL <> '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10; +SELECT Title, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND Title <> '' GROUP BY Title ORDER BY PageViews DESC LIMIT 10; +SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND IsLink <> 0 AND IsDownload = 0 GROUP BY URL ORDER BY PageViews DESC LIMIT 10 OFFSET 1000; +SELECT TraficSourceID, SearchEngineID, AdvEngineID, CASE WHEN (SearchEngineID = 0 AND AdvEngineID = 0) THEN Referer ELSE '' END AS Src, URL AS Dst, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 GROUP BY TraficSourceID, SearchEngineID, AdvEngineID, Src, Dst ORDER BY PageViews DESC LIMIT 10 OFFSET 1000; +SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND TraficSourceID IN (-1, 6) AND RefererHash = 3594120000172545465 GROUP BY URLHash, EventDate ORDER BY PageViews DESC LIMIT 10 OFFSET 100; +SELECT WindowClientWidth, WindowClientHeight, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND IsRefresh = 0 AND DontCountHits = 0 AND URLHash = 2868770270353813622 GROUP BY WindowClientWidth, WindowClientHeight ORDER BY PageViews DESC LIMIT 10 OFFSET 10000; +SELECT DATE_FORMAT(EventTime, '%Y-%m-%d %H:%i:00') AS M, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-14' AND EventDate <= '2013-07-15' AND IsRefresh = 0 AND DontCountHits = 0 GROUP BY DATE_FORMAT(EventTime, '%Y-%m-%d %H:%i:00') ORDER BY DATE_FORMAT(EventTime, '%Y-%m-%d %H:%i:00') LIMIT 10 OFFSET 1000; diff --git a/tools/emr_storage_regression/standard_set/queries/ssb_flat_queries.sql b/tools/emr_storage_regression/standard_set/queries/ssb_flat_queries.sql new file mode 100644 index 0000000000..52c249a7f0 --- /dev/null +++ b/tools/emr_storage_regression/standard_set/queries/ssb_flat_queries.sql @@ -0,0 +1,13 @@ +SELECT SUM(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue FROM lineorder_flat WHERE LO_ORDERDATE >= 19930101 AND LO_ORDERDATE <= 19931231 AND LO_DISCOUNT BETWEEN 1 AND 3 AND LO_QUANTITY < 25; +SELECT SUM(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue FROM lineorder_flat WHERE LO_ORDERDATE >= 19940101 AND LO_ORDERDATE <= 19940131 AND LO_DISCOUNT BETWEEN 4 AND 6 AND LO_QUANTITY BETWEEN 26 AND 35; +SELECT SUM(LO_EXTENDEDPRICE * LO_DISCOUNT) AS revenue FROM lineorder_flat WHERE weekofyear(LO_ORDERDATE) = 6 AND LO_ORDERDATE >= 19940101 AND LO_ORDERDATE <= 19941231 AND LO_DISCOUNT BETWEEN 5 AND 7 AND LO_QUANTITY BETWEEN 26 AND 35; +SELECT SUM(LO_REVENUE), (LO_ORDERDATE DIV 10000) AS YEAR, P_BRAND FROM lineorder_flat WHERE P_CATEGORY = 'MFGR#12' AND S_REGION = 'AMERICA' GROUP BY YEAR, P_BRAND ORDER BY YEAR, P_BRAND; +SELECT SUM(LO_REVENUE), (LO_ORDERDATE DIV 10000) AS YEAR, P_BRAND FROM lineorder_flat WHERE P_BRAND >= 'MFGR#2221' AND P_BRAND <= 'MFGR#2228' AND S_REGION = 'ASIA' GROUP BY YEAR, P_BRAND ORDER BY YEAR, P_BRAND; +SELECT SUM(LO_REVENUE), (LO_ORDERDATE DIV 10000) AS YEAR, P_BRAND FROM lineorder_flat WHERE P_BRAND = 'MFGR#2239' AND S_REGION = 'EUROPE' GROUP BY YEAR, P_BRAND ORDER BY YEAR, P_BRAND; +SELECT C_NATION, S_NATION, (LO_ORDERDATE DIV 10000) AS YEAR, SUM(LO_REVENUE) AS revenue FROM lineorder_flat WHERE C_REGION = 'ASIA' AND S_REGION = 'ASIA' AND LO_ORDERDATE >= 19920101 AND LO_ORDERDATE <= 19971231 GROUP BY C_NATION, S_NATION, YEAR ORDER BY YEAR ASC, revenue DESC; +SELECT C_CITY, S_CITY, (LO_ORDERDATE DIV 10000) AS YEAR, SUM(LO_REVENUE) AS revenue FROM lineorder_flat WHERE C_NATION = 'UNITED STATES' AND S_NATION = 'UNITED STATES' AND LO_ORDERDATE >= 19920101 AND LO_ORDERDATE <= 19971231 GROUP BY C_CITY, S_CITY, YEAR ORDER BY YEAR ASC, revenue DESC; +SELECT C_CITY, S_CITY, (LO_ORDERDATE DIV 10000) AS YEAR, SUM(LO_REVENUE) AS revenue FROM lineorder_flat WHERE C_CITY IN ('UNITED KI1', 'UNITED KI5') AND S_CITY IN ('UNITED KI1', 'UNITED KI5') AND LO_ORDERDATE >= 19920101 AND LO_ORDERDATE <= 19971231 GROUP BY C_CITY, S_CITY, YEAR ORDER BY YEAR ASC, revenue DESC; +SELECT C_CITY, S_CITY, (LO_ORDERDATE DIV 10000) AS YEAR, SUM(LO_REVENUE) AS revenue FROM lineorder_flat WHERE C_CITY IN ('UNITED KI1', 'UNITED KI5') AND S_CITY IN ('UNITED KI1', 'UNITED KI5') AND LO_ORDERDATE >= 19971201 AND LO_ORDERDATE <= 19971231 GROUP BY C_CITY, S_CITY, YEAR ORDER BY YEAR ASC, revenue DESC; +SELECT (LO_ORDERDATE DIV 10000) AS YEAR, C_NATION, SUM(LO_REVENUE - LO_SUPPLYCOST) AS profit FROM lineorder_flat WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND P_MFGR IN ('MFGR#1', 'MFGR#2') GROUP BY YEAR, C_NATION ORDER BY YEAR ASC, C_NATION ASC; +SELECT (LO_ORDERDATE DIV 10000) AS YEAR, S_NATION, P_CATEGORY, SUM(LO_REVENUE - LO_SUPPLYCOST) AS profit FROM lineorder_flat WHERE C_REGION = 'AMERICA' AND S_REGION = 'AMERICA' AND LO_ORDERDATE >= 19970101 AND LO_ORDERDATE <= 19981231 AND P_MFGR IN ('MFGR#1', 'MFGR#2') GROUP BY YEAR, S_NATION, P_CATEGORY ORDER BY YEAR ASC, S_NATION ASC, P_CATEGORY ASC; +SELECT (LO_ORDERDATE DIV 10000) AS YEAR, S_CITY, P_BRAND, SUM(LO_REVENUE - LO_SUPPLYCOST) AS profit FROM lineorder_flat WHERE S_NATION = 'UNITED STATES' AND LO_ORDERDATE >= 19970101 AND LO_ORDERDATE <= 19981231 AND P_CATEGORY = 'MFGR#14' GROUP BY YEAR, S_CITY, P_BRAND ORDER BY YEAR ASC, S_CITY ASC, P_BRAND ASC; diff --git a/tools/emr_storage_regression/standard_set/queries/ssb_queries.sql b/tools/emr_storage_regression/standard_set/queries/ssb_queries.sql new file mode 100644 index 0000000000..8b5e336593 --- /dev/null +++ b/tools/emr_storage_regression/standard_set/queries/ssb_queries.sql @@ -0,0 +1,13 @@ +SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE FROM lineorder, dates WHERE lo_orderdate = d_datekey AND d_year = 1993 AND lo_discount BETWEEN 1 AND 3 AND lo_quantity < 25; +SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE FROM lineorder, dates WHERE lo_orderdate = d_datekey AND d_yearmonth = 'Jan1994' AND lo_discount BETWEEN 4 AND 6 AND lo_quantity BETWEEN 26 AND 35; +SELECT SUM(lo_extendedprice * lo_discount) AS REVENUE FROM lineorder, dates WHERE lo_orderdate = d_datekey AND d_weeknuminyear = 6 AND d_year = 1994 AND lo_discount BETWEEN 5 AND 7 AND lo_quantity BETWEEN 26 AND 35; +SELECT SUM(lo_revenue), d_year, p_brand FROM lineorder, dates, part, supplier WHERE lo_orderdate = d_datekey AND lo_partkey = p_partkey AND lo_suppkey = s_suppkey AND p_category = 'MFGR#12' AND s_region = 'AMERICA' GROUP BY d_year, p_brand ORDER BY p_brand; +SELECT SUM(lo_revenue), d_year, p_brand FROM lineorder, dates, part, supplier WHERE lo_orderdate = d_datekey AND lo_partkey = p_partkey AND lo_suppkey = s_suppkey AND p_brand BETWEEN 'MFGR#2221' AND 'MFGR#2228' AND s_region = 'ASIA' GROUP BY d_year, p_brand ORDER BY d_year, p_brand; +SELECT SUM(lo_revenue), d_year, p_brand FROM lineorder, dates, part, supplier WHERE lo_orderdate = d_datekey AND lo_partkey = p_partkey AND lo_suppkey = s_suppkey AND p_brand = 'MFGR#2239' AND s_region = 'EUROPE' GROUP BY d_year, p_brand ORDER BY d_year, p_brand; +SELECT c_nation, s_nation, d_year, SUM(lo_revenue) AS REVENUE FROM customer, lineorder, supplier, dates WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_orderdate = d_datekey AND c_region = 'ASIA' AND s_region = 'ASIA' AND d_year >= 1992 AND d_year <= 1997 GROUP BY c_nation, s_nation, d_year ORDER BY d_year ASC, REVENUE DESC; +SELECT c_city, s_city, d_year, SUM(lo_revenue) AS REVENUE FROM customer, lineorder, supplier, dates WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_orderdate = d_datekey AND c_nation = 'UNITED STATES' AND s_nation = 'UNITED STATES' AND d_year >= 1992 AND d_year <= 1997 GROUP BY c_city, s_city, d_year ORDER BY d_year ASC, REVENUE DESC; +SELECT c_city, s_city, d_year, SUM(lo_revenue) AS REVENUE FROM customer, lineorder, supplier, dates WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_orderdate = d_datekey AND ( c_city = 'UNITED KI1' OR c_city = 'UNITED KI5' ) AND ( s_city = 'UNITED KI1' OR s_city = 'UNITED KI5' ) AND d_year >= 1992 AND d_year <= 1997 GROUP BY c_city, s_city, d_year ORDER BY d_year ASC, REVENUE DESC; +SELECT c_city, s_city, d_year, SUM(lo_revenue) AS REVENUE FROM customer, lineorder, supplier, dates WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_orderdate = d_datekey AND ( c_city = 'UNITED KI1' OR c_city = 'UNITED KI5' ) AND ( s_city = 'UNITED KI1' OR s_city = 'UNITED KI5' ) AND d_yearmonth = 'Dec1997' GROUP BY c_city, s_city, d_year ORDER BY d_year ASC, REVENUE DESC; +SELECT d_year, c_nation, SUM(lo_revenue - lo_supplycost) AS PROFIT FROM dates, customer, supplier, part, lineorder WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_partkey = p_partkey AND lo_orderdate = d_datekey AND c_region = 'AMERICA' AND s_region = 'AMERICA' AND ( p_mfgr = 'MFGR#1' OR p_mfgr = 'MFGR#2' ) GROUP BY d_year, c_nation ORDER BY d_year, c_nation; +SELECT d_year, s_nation, p_category, SUM(lo_revenue - lo_supplycost) AS PROFIT FROM dates, customer, supplier, part, lineorder WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_partkey = p_partkey AND lo_orderdate = d_datekey AND c_region = 'AMERICA' AND s_region = 'AMERICA' AND ( d_year = 1997 OR d_year = 1998 ) AND ( p_mfgr = 'MFGR#1' OR p_mfgr = 'MFGR#2' ) GROUP BY d_year, s_nation, p_category ORDER BY d_year, s_nation, p_category; +SELECT d_year, s_city, p_brand, SUM(lo_revenue - lo_supplycost) AS PROFIT FROM dates, customer, supplier, part, lineorder WHERE lo_custkey = c_custkey AND lo_suppkey = s_suppkey AND lo_partkey = p_partkey AND lo_orderdate = d_datekey AND s_nation = 'UNITED STATES' AND ( d_year = 1997 OR d_year = 1998 ) AND p_category = 'MFGR#14' GROUP BY d_year, s_city, p_brand ORDER BY d_year, s_city, p_brand; diff --git a/tools/emr_storage_regression/standard_set/queries/tpch_queries.sql b/tools/emr_storage_regression/standard_set/queries/tpch_queries.sql new file mode 100644 index 0000000000..43d46e0b0b --- /dev/null +++ b/tools/emr_storage_regression/standard_set/queries/tpch_queries.sql @@ -0,0 +1,22 @@ +select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '90' day group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; +select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like '%BRASS' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and ps_supplycost = ( select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regi [...] +select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < date '1995-03-15' and l_shipdate > date '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; +select o_orderpriority, count(*) as order_count from orders where o_orderdate >= date '1993-07-01' and o_orderdate < date '1993-07-01' + interval '3' month and exists ( select * from lineitem where l_orderkey = o_orderkey and l_commitdate < l_receiptdate ) group by o_orderpriority order by o_orderpriority; +select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'ASIA' and o_orderdate >= date '1994-01-01' and o_orderdate < date '1994-01-01' + interval '1' year group by n_name order by revenue desc; +select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= date '1994-01-01' and l_shipdate < date '1994-01-01' + interval '1' year and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24; +select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, extract(year from l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from supplier, lineitem, orders, customer, nation n1, nation n2 where s_suppkey = l_suppkey and o_orderkey = l_orderkey and c_custkey = o_custkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ( (n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY') or (n1.n_ [...] +select o_year, sum(case when nation = 'BRAZIL' then volume else 0 end) / sum(volume) as mkt_share from ( select extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) as volume, n2.n_name as nation from part, supplier, lineitem, orders, customer, nation n1, nation n2, region where p_partkey = l_partkey and s_suppkey = l_suppkey and l_orderkey = o_orderkey and o_custkey = c_custkey and c_nationkey = n1.n_nationkey and n1.n_regionkey = r_regionkey and r_name = 'AMERIC [...] +select nation, o_year, sum(amount) as sum_profit from ( select n_name as nation, extract(year from o_orderdate) as o_year, l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount from part, supplier, lineitem, partsupp, orders, nation where s_suppkey = l_suppkey and ps_suppkey = l_suppkey and ps_partkey = l_partkey and p_partkey = l_partkey and o_orderkey = l_orderkey and s_nationkey = n_nationkey and p_name like '%green%' ) as profit group by nation, o_year order by na [...] +select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment from customer, orders, lineitem, nation where c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= date '1993-10-01' and o_orderdate < date '1993-10-01' + interval '3' month and l_returnflag = 'R' and c_nationkey = n_nationkey group by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment order by revenue desc limit 20; +select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.000002 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' ) order by value desc; +select l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1 else 0 end) as high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1 else 0 end) as low_line_count from orders, lineitem where o_orderkey = l_orderkey and l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate and l_shipdate < l_commitdate and l_receiptdate >= date '1994-01-01' and l_receiptdate < date '1994-01-01' + interval '1' year g [...] +select c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) as c_count from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%special%requests%' group by c_custkey ) as c_orders group by c_count order by custdist desc, c_count desc; +select 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else 0 end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue from lineitem, part where l_partkey = p_partkey and l_shipdate >= date '1995-09-01' and l_shipdate < date '1995-09-01' + interval '1' month; +with revenue0 as ( select l_suppkey supplier_no, sum(l_extendedprice * (1 - l_discount)) total_revenue from lineitem where l_shipdate >= date '1996-01-01' and l_shipdate < date '1996-01-01' + interval '3' month group by l_suppkey ) select /*+SET_VAR(enable_nereids_planner=true,enable_pipeline_engine=true) */ s_suppkey, s_name, s_address, s_phone, total_revenue from supplier, revenue0 where s_suppkey = supplier_no and total_revenue = ( select max(total_revenue) from revenue0 ) order by s_ [...] +select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#45' and p_type not like 'MEDIUM POLISHED%' and p_size in (49, 14, 23, 45, 19, 3, 36, 9) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size; +select sum(l_extendedprice) / 7.0 as avg_yearly from lineitem, part where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX' and l_quantity < ( select 0.2 * avg(l_quantity) from lineitem where l_partkey = p_partkey ); +select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) from customer, orders, lineitem where o_orderkey in ( select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) > 300 ) and c_custkey = o_custkey and o_orderkey = l_orderkey group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice order by o_totalprice desc, o_orderdate limit 100; +select sum(l_extendedprice* (1 - l_discount)) as revenue from lineitem, part where ( p_partkey = l_partkey and p_brand = 'Brand#12' and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 1 and l_quantity <= 1 + 10 and p_size between 1 and 5 and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' ) or ( p_partkey = l_partkey and p_brand = 'Brand#23' and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >= 10 and l_qu [...] +select s_name, s_address from supplier, nation where s_suppkey in ( select ps_suppkey from partsupp where ps_partkey in ( select p_partkey from part where p_name like 'forest%' ) and ps_availqty > ( select 0.5 * sum(l_quantity) from lineitem where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date '1994-01-01' and l_shipdate < date '1994-01-01' + interval '1' year ) ) and s_nationkey = n_nationkey and n_name = 'CANADA' order by s_name; +select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nation [...] +select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbal from ( select substring(c_phone, 1, 2) as cntrycode, c_acctbal from customer where substring(c_phone, 1, 2) in ('13', '31', '23', '29', '30', '18', '17') and c_acctbal > ( select avg(c_acctbal) from customer where c_acctbal > 0.00 and substring(c_phone, 1, 2) in ('13', '31', '23', '29', '30', '18', '17') ) and not exists ( select * from orders where o_custkey = c_custkey ) ) as custsale group by cntrycode order by cntrycode; diff --git a/tools/emr_storage_regression/standard_set/run_queries.sh b/tools/emr_storage_regression/standard_set/run_queries.sh new file mode 100644 index 0000000000..6448c1ae74 --- /dev/null +++ b/tools/emr_storage_regression/standard_set/run_queries.sh @@ -0,0 +1,46 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# See emr_tools.sh +############################################################## + +set -e + +FE_HOST=$1 +USER=$2 +FE_QUERY_PORT=$3 +DB=$4 + +TRIES=3 +QUERY_NUM=1 +RESULT_FILE=result-master-"${DB}".csv +touch "${RESULT_FILE}" +truncate -s 0 "${RESULT_FILE}" + +while read -r query; do + echo -n "query${QUERY_NUM}," | tee -a "${RESULT_FILE}" + for i in $(seq 1 "${TRIES}"); do + RES=$(mysql -vvv -h"${FE_HOST}" -u"${USER}" -P"${FE_QUERY_PORT}" -D"${DB}" -e "${query}" | perl -nle 'print $1 if /((\d+\.\d+)+ sec)/' || :) + echo -n "${RES}" | tee -a "${RESULT_FILE}" + [[ "${i}" != "${TRIES}" ]] && echo -n "," | tee -a "${RESULT_FILE}" + done + echo "" | tee -a "${RESULT_FILE}" + + QUERY_NUM=$((QUERY_NUM + 1)) +done <"$5" diff --git a/tools/emr_storage_regression/standard_set/run_standard_set.sh b/tools/emr_storage_regression/standard_set/run_standard_set.sh new file mode 100644 index 0000000000..b237795129 --- /dev/null +++ b/tools/emr_storage_regression/standard_set/run_standard_set.sh @@ -0,0 +1,76 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# See emr_tools.sh +############################################################## + +# usage: sh run.sh dlf parquet +FE_HOST=$1 +USER=$2 +PORT=$3 +if [[ -z "$4" ]]; then + echo 'need catalog name' + exit +else + catalog_name=$4 +fi + +if [[ -z "$5" ]]; then + echo "run all test default" +elif [[ "$5" = 'all' ]]; then + echo "run all test" +else + case=$5 +fi + +if [[ -z ${TYPE} ]]; then + TYPE=obj +fi +echo "execute ${case} benchmark for ${TYPE}..." + +if [[ "${case}" = 'ssb' ]]; then + # ssb + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".ssb100_parquet_"${TYPE}" queries/ssb_queries.sql + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".ssb100_orc_"${TYPE}" queries/ssb_queries.sql +elif [[ "${case}" = 'ssb_flat' ]]; then + # ssb_flat + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".ssb100_parquet_"${TYPE}" queries/ssb_flat_queries.sql + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".ssb100_orc_"${TYPE}" queries/ssb_flat_queries.sql +elif [[ "${case}" = 'tpch' ]]; then + # tpch + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".tpch100_parquet_"${TYPE}" queries/tpch_queries.sql + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".tpch100_orc_"${TYPE}" queries/tpch_queries.sql +elif [[ "${case}" = 'clickbench' ]]; then + # clickbench + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".clickbench_parquet_"${TYPE}" queries/clickbench_queries.sql + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".clickbench_orc_"${TYPE}" queries/clickbench_queries.sql +else + # run all + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".ssb100_parquet_"${TYPE}" queries/ssb_queries.sql + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".ssb100_orc_"${TYPE}" queries/ssb_queries.sql + + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".ssb100_parquet_"${TYPE}" queries/ssb_flat_queries.sql + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".ssb100_orc_"${TYPE}" queries/ssb_flat_queries.sql + + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".tpch100_parquet_"${TYPE}" queries/tpch_queries.sql + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".tpch100_orc_"${TYPE}" queries/tpch_queries.sql + + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".clickbench_parquet_"${TYPE}" queries/clickbench_queries.sql + sh run_queries.sh "${FE_HOST}" "${USER}" "${PORT}" "${catalog_name}".clickbench_orc_"${TYPE}" queries/clickbench_queries.sql +fi --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org