Re: Tasks, slots, and partitioned joins

2017-10-26 Thread Fabian Hueske
Hi David, please find my answers below: 1. For high utilization, all slot should be filled. Each slot will processes a slice of the program on a slice of the data. In case of partitioning or changed parallelism, the data is shuffled accordingly . 2. That's a good question. I think the default log

Re: Tasks, slots, and partitioned joins

2017-10-26 Thread David Dreyfus
Hi Fabian, Thank you for the great, detailed answers. 1. So, each parallel slice of the DAG is placed into one slot. The key to high utilization is many slices of the source data (or the various methods of repartitioning it). Yes? 2. In batch processing, are slots filled round-robin on task manag

Re: Tasks, slots, and partitioned joins

2017-10-26 Thread Fabian Hueske
Hi David, Flink's DataSet API schedules one slice of a program to a task slot. A program slice is one parallel instance of each operator of a program. When all operator of your program run with a parallelism of 1, you end up with only 1 slice that runs on a single slot. Flink's DataSet API leverag

Tasks, slots, and partitioned joins

2017-10-25 Thread David Dreyfus
Hello - I have a large number of pairs of files. For purpose of discussion: /source1/{1..1} and /source2/{1..1}. I want to join the files pair-wise: /source1/1 joined to /source2/1, /source1/2 joined to /source2/2, and so on. I then want to union the results of the pair-wise joins and per