[ 
https://issues.apache.org/jira/browse/FLINK-15498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009520#comment-17009520
 ] 

Jingsong Lee commented on FLINK-15498:
--------------------------------------

Hi [~ykt836]

This ticket aims to make TPC-DS testing more productive.

Users who want to try Flink batch want to reproduce the benchmark scores of 
Flink batch. They have the following difficulties:
 # Generate TPC-DS data. 
 # Prepare tables: create tables, write csv data to orc format, analysis tables.
 # Execute select query in flink batch.
 # Execute select query in hive/Tez/Spark/Presto.

For #1, we can only provide a little help, and users are more likely to view 
the official TPC-DS documents.

But for #2, we can provide the preparation step of hive, user can easily 
reproduce in his cluster with our e2e codes. And these tables can be read from 
other system too. As far as I know, this step is troublesome, involving 
creating hive table with nullable and PKs and orc compression and column types, 
tpc-ds origin data to orc tables, analysis tables.

And for #3, e2e should be exactly the same as benchmark. 

> Using HiveCatalog in TPC-DS e2e
> -------------------------------
>
>                 Key: FLINK-15498
>                 URL: https://issues.apache.org/jira/browse/FLINK-15498
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Planner, Tests
>            Reporter: Jingsong Lee
>            Priority: Major
>             Fix For: 1.11.0
>
>
> In 1.10, we have made great progress in the performance and function of 
> batch. After our internal test, the performance is significantly ahead of 
> hive.
> But it's hard for users to reproduce. They need to have some research on 
> TPC-DS to write test code.
> We can consider changing the E2E test of TPC-DS to HiveCatalog, which is 
> roughly divided into two stages:
>  # The first stage is prepare of hive. Prepare the tables of TPC-DS. Insert 
> the data and prepare the metastore. And analysis the tables.
>  # The second stage is the analysis of Flink. Only select and check results.
> Users can play with it only by changing the data scale of the first stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to