Hi Tobias, sorry for the late reply. Did you checkout the code for starting flink on docker already? https://github.com/apache/flink/tree/master/flink-contrib/docker-flink Maybe that will save you some time ;)
Benchmarks using TPC-* data are quite popular. Maybe this is also helpful for you: https://amplab.cs.berkeley.edu/benchmark/ As far as I understood it, you are planning to benchmark flink on bare metal vs flink on docker. I suspect the IO penalty when using (para)virtualization is the highest, so I would do IO (disk, network) intensive tests. For example a joining data which is much bigger than the available main memory or a word count with a lot of unique words. I think you'll see differences on the hardware you're planning to rent. I don't know any relevant papers for that, but maybe your advisor can help you with that. Best, Robert On Mon, May 4, 2015 at 2:05 PM, Tobias Wiens <tobwi...@gmail.com> wrote: > Hi, > > I am writing my master thesis about how to orchestrate scalable tasks and > services with docker. The idea is to swap a task's environment very fast, I > plan to use Flink as the environment. For that I package Apache Flink into > container and start one master and X worker nodes plus one task on them. > Like a short term Flink PaaS. > > I want to measure the performance of one Flink cluster and compare it to a > Flink cluster inside containers. > That benchmark will be part of a series of benchmarks in which I test disk, > network, memory and CPU. So the Flink benchmark can be an applied use-case > which mixes all sorts of demands. > > I saw the TPC-H data generator example ( > > https://github.com/rmetzger/scratch/blob/distributed-tpch-generator/src/main/java/flink/generators/programs/TPCHGeneratorExample.java > ) > and I read about the sorting benchmarks on (http://sortbenchmark.org/). > TPC-H: I am not sure how meaningful it is to join those two tables and if > it is commonly used in publications. I thought the minute sort might be > interesting because it sounds small. > > *Which benchmarks would you recommend?* The benchmarks should be easy to > implement. > > I don't have any infrastructure, I want to rent bare metal servers at > softlayer ( > https://www.softlayer.com/Store/orderHourlyBareMetalInstance/37276/64) > maybe 5 servers (4 cores, 8GB ram).* Does that make sense? Will I even see > differences in such a "small" cluster? * > > Any help or pointing for papers which do benchmark of data processing > platforms very welcome. Thank you! > > Best regards, > Tobias >