Re: what to consider when testing a data stream application using the TPC-H benchmark data?

Felipe Gutierrez Mon, 22 Jun 2020 06:37:24 -0700

Hi Arvid,

thanks for the references. I didn't find those tests before. I will
definitely consider them to test my application.


The thing is that I am testing a pre-aggregation stream operator that I
have implemented. Particularly I need a high workload to create
backpressure on the shuffle phase, after the keyBy transformation is done.
And I am monitoring the throughput only of this operator. So, I will stick
with the source function but consider what there is on the other references.

I know that the Table API already has a pre-agg [2]. However, mine works a
little bit differently.

[2]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tuning/streaming_aggregation_optimization.html

Thanks,
Felipe
*--*
*-- Felipe Gutierrez*

*-- skype: felipe.o.gutierrez*
*--* *https://felipeogutierrez.blogspot.com
<https://felipeogutierrez.blogspot.com>*


On Mon, Jun 22, 2020 at 2:54 PM Arvid Heise <ar...@ververica.com> wrote:

> Hi Felipe,
>
> The examples are pretty old (6 years), hence they still use DataSet.
>
> You should be fine by mostly replacing sources with file sources (no need
> to write your own source, except you want to generators) and using global
> windows for joining.
>
> However, why not use SQL for TPC-H? We have an e2e test [1], where some
> TPC-H queries are used (in slightly modified form) [2].
> We also have TPC-DS queries as e2e tests [3].
>
> [1]
> https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-tpch-test
> [2]
> https://github.com/apache/flink/tree/master/flink-end-to-end-tests/test-scripts/test-data/tpch/modified-query
> [3]
> https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-tpcds-test
>
> On Mon, Jun 22, 2020 at 12:35 PM Felipe Gutierrez <
> felipe.o.gutier...@gmail.com> wrote:
>
>> Hi all,
>>
>> I would like to create some data stream queries tests using the TPC-H
>> benchmark. I saw that there are some examples of TPC Q3[1] and Q10[2],
>> however, they are using DataSet. If I consider creating these queries
>> but using DataStream what are the caveats that I have to ensure when
>> implementing the source function? I mean, the frequency of emitting
>> items is certainly the first. I suppose that I would change the
>> frequency of the workload globally for all data sources. Is only it or
>> do you have other things to consider?
>>
>> [1]
>> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/relational/TPCHQuery3.java
>> [2]
>> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/relational/TPCHQuery10.java
>>
>> Thanks,
>> Felipe
>> --
>> -- Felipe Gutierrez
>> -- skype: felipe.o.gutierrez
>> -- https://felipeogutierrez.blogspot.com
>>
>
>
> --
>
> Arvid Heise | Senior Java Developer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>

Re: what to consider when testing a data stream application using the TPC-H benchmark data?

Reply via email to