Hi Hanna, I assume you are asking about the possible speed up of batch analysis programs and not about streaming applications (please correct me if I'm wrong).
Timur raised very good points about data size and skew. Given evenly distributed data (no skewed key distribution for a grouping or join operation) and sufficiently large data sets, Flink scales quite well. Flink uses by default pipelined shuffles. Depending on the parallelism, you need to adapt the number of network buffers (see taskmanager.network.numberOfBuffers [1]). When the required number of network buffers becomes too large, you can switch to batched shuffles (set ExecutionMode.BATCH on ExecutionConfig of ExecutionEnvironment). Apart from that, really large parallelism of complex programs might suffer from the scheduling overhead of individual tasks. Here the JobManager might become a bottleneck when assigning tasks to workers so that there might be an initial delay before a job starts processing. If your data is skewed to too small, scaling out doesn't help because a single worker will be busy working while all others are waiting for it or the overhead of distributing the work becomes too large. Hope this helps, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/setup/config.html#jobmanager-amp-taskmanager 2017-01-03 15:44 GMT+01:00 Timur Shenkao <t...@timshenkao.su>: > Hi, > It seems your questions are too abstract & theoretical. The answer is : it > depends on several factors. Skewness in data, data volume, reliability > requirements, "fatness" of servers, whether one performs look-up in other > data sources, etc. > The papers you mentioned mean the following: under concrete & specific > conditions, researchers achieved their results. If they had changed some > parameters slightly (increase network's throughput, for example, or change > garbage collector's options) , the results would have been completely > different. > > > On Tuesday, January 3, 2017, Hanna Prinz <hanna_pr...@yahoo.de> wrote: > >> Happy new year everyone :) >> >> I’m currently working on a paper about Flink. I already got some >> recommendations on general papers with details about Flink, which helped me >> a lot already. But now that I read them,* I’m further interested is the >> speedup capabilities, provided by the Flink Framework: How „far“ can it >> scale efficiently?* >> >> Amdahls law states that a parallelization is only efficient as long as >> the non-parallelizable part of the processing (time for the communication >> between the nodes etc.) doesn’t „eat up“ the speed gains of parallelization >> (= parallel slowdown). >> Of course, the communication overhead is mostly caused by the >> implementation, but the frameworks specific solution for the communication >> between the nodes has a reasonable effect as well. >> >> After studying these papers, it looks like, although Flinks performance >> is better in many cases, the possible speedup is equal to the possible >> speedup of Spark. >> >> 1. Spark versus Flink - Understanding Performance in Big Data Analytics >> Frameworks | https://hal.inria.fr/hal-01347638/document >> 2. Big Data Analytics on Cray XC Series DataWarp using Hadoop, Spark and >> Flink | https://cug.org/proceedings/cug2016_proceedings/includes/f >> iles/pap141.pdf >> 3. Thrill - High-Performance Algorithmic Distributed Batch Data >> Processing with C++ | https://panthema.net/2016/08 >> 16-Thrill-High-Performance-Algorithmic-Distributed-Batch-Dat >> a-Processing-with-CPP/1608.05634v1.pdf >> >> >> Does someone have … >> … more information (or data) on speedup of Flink applications? >> … experience (or data) with Flink in an extremely paralellized >> environment? >> … detailed information on how the nodes communicate, especially when they >> are waiting for task results of one another? >> >> Thank you very much for your time & answers! >> Hanna >> >