Hi Robert, thank you for your reply. Yes I read on the mailing list about it, very nice that you maintain it as part of the flink project now. I might swap to use those dockerfiles.
Thank you for that tip, I will look more into the TPC-* direction. You are right I expect some impact in reading and writing files, due to the copy-on-write file-system, which I might be able to overcome by mounting the hosts file system. Since container use the underlying operating system's drivers the impact will be minimal to not existent compared to virtual machines. I also expect the network, since I will connect the container via an overlay network, to have significant impact. Best regards, Tobias On 7 May 2015 at 17:25, Robert Metzger <rmetz...@apache.org> wrote: > Hi Tobias, > > sorry for the late reply. > Did you checkout the code for starting flink on docker already? > https://github.com/apache/flink/tree/master/flink-contrib/docker-flink > Maybe that will save you some time ;) > > Benchmarks using TPC-* data are quite popular. > Maybe this is also helpful for you: > https://amplab.cs.berkeley.edu/benchmark/ > > As far as I understood it, you are planning to benchmark flink on bare > metal vs flink on docker. > I suspect the IO penalty when using (para)virtualization is the highest, so > I would do IO (disk, network) intensive tests. For example a joining data > which is much bigger than the available main memory or a word count with a > lot of unique words. > > I think you'll see differences on the hardware you're planning to rent. > I don't know any relevant papers for that, but maybe your advisor can help > you with that. > > Best, > Robert > > > > On Mon, May 4, 2015 at 2:05 PM, Tobias Wiens <tobwi...@gmail.com> wrote: > > > Hi, > > > > I am writing my master thesis about how to orchestrate scalable tasks and > > services with docker. The idea is to swap a task's environment very > fast, I > > plan to use Flink as the environment. For that I package Apache Flink > into > > container and start one master and X worker nodes plus one task on them. > > Like a short term Flink PaaS. > > > > I want to measure the performance of one Flink cluster and compare it to > a > > Flink cluster inside containers. > > That benchmark will be part of a series of benchmarks in which I test > disk, > > network, memory and CPU. So the Flink benchmark can be an applied > use-case > > which mixes all sorts of demands. > > > > I saw the TPC-H data generator example ( > > > > > https://github.com/rmetzger/scratch/blob/distributed-tpch-generator/src/main/java/flink/generators/programs/TPCHGeneratorExample.java > > ) > > and I read about the sorting benchmarks on (http://sortbenchmark.org/). > > TPC-H: I am not sure how meaningful it is to join those two tables and if > > it is commonly used in publications. I thought the minute sort might be > > interesting because it sounds small. > > > > *Which benchmarks would you recommend?* The benchmarks should be easy to > > implement. > > > > I don't have any infrastructure, I want to rent bare metal servers at > > softlayer ( > > https://www.softlayer.com/Store/orderHourlyBareMetalInstance/37276/64) > > maybe 5 servers (4 cores, 8GB ram).* Does that make sense? Will I even > see > > differences in such a "small" cluster? * > > > > Any help or pointing for papers which do benchmark of data processing > > platforms very welcome. Thank you! > > > > Best regards, > > Tobias > > >