Hi Arvid, thanks for the references. I didn't find those tests before. I will definitely consider them to test my application.
The thing is that I am testing a pre-aggregation stream operator that I have implemented. Particularly I need a high workload to create backpressure on the shuffle phase, after the keyBy transformation is done. And I am monitoring the throughput only of this operator. So, I will stick with the source function but consider what there is on the other references. I know that the Table API already has a pre-agg [2]. However, mine works a little bit differently. [2] https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tuning/streaming_aggregation_optimization.html Thanks, Felipe *--* *-- Felipe Gutierrez* *-- skype: felipe.o.gutierrez* *--* *https://felipeogutierrez.blogspot.com <https://felipeogutierrez.blogspot.com>* On Mon, Jun 22, 2020 at 2:54 PM Arvid Heise <ar...@ververica.com> wrote: > Hi Felipe, > > The examples are pretty old (6 years), hence they still use DataSet. > > You should be fine by mostly replacing sources with file sources (no need > to write your own source, except you want to generators) and using global > windows for joining. > > However, why not use SQL for TPC-H? We have an e2e test [1], where some > TPC-H queries are used (in slightly modified form) [2]. > We also have TPC-DS queries as e2e tests [3]. > > [1] > https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-tpch-test > [2] > https://github.com/apache/flink/tree/master/flink-end-to-end-tests/test-scripts/test-data/tpch/modified-query > [3] > https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-tpcds-test > > On Mon, Jun 22, 2020 at 12:35 PM Felipe Gutierrez < > felipe.o.gutier...@gmail.com> wrote: > >> Hi all, >> >> I would like to create some data stream queries tests using the TPC-H >> benchmark. I saw that there are some examples of TPC Q3[1] and Q10[2], >> however, they are using DataSet. If I consider creating these queries >> but using DataStream what are the caveats that I have to ensure when >> implementing the source function? I mean, the frequency of emitting >> items is certainly the first. I suppose that I would change the >> frequency of the workload globally for all data sources. Is only it or >> do you have other things to consider? >> >> [1] >> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/relational/TPCHQuery3.java >> [2] >> https://github.com/apache/flink/blob/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java/relational/TPCHQuery10.java >> >> Thanks, >> Felipe >> -- >> -- Felipe Gutierrez >> -- skype: felipe.o.gutierrez >> -- https://felipeogutierrez.blogspot.com >> > > > -- > > Arvid Heise | Senior Java Developer > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > (Toni) Cheng >