Hi Tobias,

sorry for the late reply.
Did you checkout the code for starting flink on docker already?
https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
Maybe that will save you some time ;)

Benchmarks using TPC-* data are quite popular.
Maybe this is also helpful for you:
https://amplab.cs.berkeley.edu/benchmark/

As far as I understood it, you are planning to benchmark flink on bare
metal vs flink on docker.
I suspect the IO penalty when using (para)virtualization is the highest, so
I would do IO (disk, network) intensive tests. For example a joining data
which is much bigger than the available main memory or a word count with a
lot of unique words.

I think you'll see differences on the hardware you're planning to rent.
I don't know any relevant papers for that, but maybe your advisor can help
you with that.

Best,
Robert



On Mon, May 4, 2015 at 2:05 PM, Tobias Wiens <tobwi...@gmail.com> wrote:

> Hi,
>
> I am writing my master thesis about how to orchestrate scalable tasks and
> services with docker. The idea is to swap a task's environment very fast, I
> plan to use Flink as the environment. For that I package Apache Flink into
> container and start one master and X worker nodes plus one task on them.
> Like a short term Flink PaaS.
>
> I want to measure the performance of one Flink cluster and compare it to a
> Flink cluster inside containers.
> That benchmark will be part of a series of benchmarks in which I test disk,
> network, memory and CPU. So the Flink benchmark can be an applied use-case
> which mixes all sorts of demands.
>
> I saw the TPC-H data generator example (
>
> https://github.com/rmetzger/scratch/blob/distributed-tpch-generator/src/main/java/flink/generators/programs/TPCHGeneratorExample.java
> )
> and I read about the sorting benchmarks on (http://sortbenchmark.org/).
> TPC-H: I am not sure how meaningful it is to join those two tables and if
> it is commonly used in publications. I thought the minute sort might be
> interesting because it sounds small.
>
> *Which benchmarks would you recommend?* The benchmarks should be easy to
> implement.
>
> I don't have any infrastructure, I want to rent bare metal servers at
> softlayer (
> https://www.softlayer.com/Store/orderHourlyBareMetalInstance/37276/64)
> maybe 5 servers (4 cores, 8GB ram).* Does that make sense? Will I even see
> differences in such a "small" cluster? *
>
> Any help or pointing for papers which do benchmark of data processing
> platforms very welcome. Thank you!
>
> Best regards,
> Tobias
>

Reply via email to