Hey Fabian and Timur,

Thank you for your helpful answers.
Especially because I'm aware that there is no simple answer to that. And also, 
I've just started to work with flink, so I might have not understood everything 
yet :)

From the documentation on task scheduling, I assumed that the JobManger might 
be a bottleneck on extreme parallelization. As for the the skewness of the 
data: I suppose that's a problem every data processing framework on a large 
scale has and there is not much to do about it besides improving the 
partitioning where possible.

I will look into the configuration you mentioned @Fabian and might get back to 
you with further questions later.

Cheers
Hanna

> Am 05.01.2017 um 11:32 schrieb Fabian Hueske <fhue...@gmail.com>:
> 
> Hi Hanna,
> 
> I assume you are asking about the possible speed up of batch analysis 
> programs and not about streaming applications (please correct me if I'm 
> wrong).
> 
> Timur raised very good points about data size and skew.
> 
> Given evenly distributed data (no skewed key distribution for a grouping or 
> join operation) and sufficiently large data sets, Flink scales quite well.
> Flink uses by default pipelined shuffles. Depending on the parallelism, you 
> need to adapt the number of network buffers (see 
> taskmanager.network.numberOfBuffers [1]). When the required number of network 
> buffers becomes too large, you can switch to batched shuffles (set 
> ExecutionMode.BATCH on ExecutionConfig of ExecutionEnvironment).
> Apart from that, really large parallelism of complex programs might suffer 
> from the scheduling overhead of individual tasks.  Here the JobManager might 
> become a bottleneck when assigning tasks to workers so that there might be an 
> initial delay before a job starts processing.
> 
> If your data is skewed to too small, scaling out doesn't help because a 
> single worker will be busy working while all others are waiting for it or the 
> overhead of distributing the work becomes too large.
> 
> Hope this helps,
> Fabian
> 
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.1/setup/config.html#jobmanager-amp-taskmanager
> 
> 
> 2017-01-03 15:44 GMT+01:00 Timur Shenkao <t...@timshenkao.su>:
>> Hi, 
>> It seems your questions are too abstract & theoretical. The answer is : it 
>> depends on several factors. Skewness in data, data volume, reliability 
>> requirements, "fatness" of servers, whether one performs look-up in other 
>> data sources, etc.
>> The papers you mentioned mean the following: under concrete & specific 
>> conditions, researchers achieved their results. If they had changed some 
>> parameters slightly (increase network's throughput, for example, or change 
>> garbage collector's options) , the results would have been completely 
>> different.
>> 
>> 
>>> On Tuesday, January 3, 2017, Hanna Prinz <hanna_pr...@yahoo.de> wrote:
>>> Happy new year everyone :)
>>> 
>>> I’m currently working on a paper about Flink. I already got some 
>>> recommendations on general papers with details about Flink, which helped me 
>>> a lot already. But now that I read them, I’m further interested is the 
>>> speedup capabilities, provided by the Flink Framework: How „far“ can it 
>>> scale efficiently?
>>> 
>>> Amdahls law states that a parallelization is only efficient as long as the 
>>> non-parallelizable part of the processing (time for the communication 
>>> between the nodes etc.) doesn’t „eat up“ the speed gains of parallelization 
>>> (= parallel slowdown). 
>>> Of course, the communication overhead is mostly caused by the 
>>> implementation, but the frameworks specific solution for the communication 
>>> between the nodes has a reasonable effect as well.
>>> 
>>> After studying these papers, it looks like, although Flinks performance is 
>>> better in many cases, the possible speedup is equal to the possible speedup 
>>> of Spark.
>>> 1. Spark versus Flink - Understanding Performance in Big Data Analytics 
>>> Frameworks | https://hal.inria.fr/hal-01347638/document
>>> 2. Big Data Analytics on Cray XC Series DataWarp using Hadoop, Spark and 
>>> Flink | 
>>> https://cug.org/proceedings/cug2016_proceedings/includes/files/pap141.pdf
>>> 3. Thrill - High-Performance Algorithmic Distributed Batch Data Processing 
>>> with C++ | 
>>> https://panthema.net/2016/0816-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP/1608.05634v1.pdf
>>> 
>>> Does someone have …
>>> … more information (or data) on speedup of Flink applications? 
>>> … experience (or data) with Flink in an extremely paralellized environment?
>>> … detailed information on how the nodes communicate, especially when they 
>>> are waiting for task results of one another?
>>> 
>>> Thank you very much for your time & answers!
>>> Hanna
> 

Reply via email to