Hi, I am writing my master thesis about how to orchestrate scalable tasks and services with docker. The idea is to swap a task's environment very fast, I plan to use Flink as the environment. For that I package Apache Flink into container and start one master and X worker nodes plus one task on them. Like a short term Flink PaaS.
I want to measure the performance of one Flink cluster and compare it to a Flink cluster inside containers. That benchmark will be part of a series of benchmarks in which I test disk, network, memory and CPU. So the Flink benchmark can be an applied use-case which mixes all sorts of demands. I saw the TPC-H data generator example ( https://github.com/rmetzger/scratch/blob/distributed-tpch-generator/src/main/java/flink/generators/programs/TPCHGeneratorExample.java) and I read about the sorting benchmarks on (http://sortbenchmark.org/). TPC-H: I am not sure how meaningful it is to join those two tables and if it is commonly used in publications. I thought the minute sort might be interesting because it sounds small. *Which benchmarks would you recommend?* The benchmarks should be easy to implement. I don't have any infrastructure, I want to rent bare metal servers at softlayer ( https://www.softlayer.com/Store/orderHourlyBareMetalInstance/37276/64) maybe 5 servers (4 cores, 8GB ram).* Does that make sense? Will I even see differences in such a "small" cluster? * Any help or pointing for papers which do benchmark of data processing platforms very welcome. Thank you! Best regards, Tobias