Hi Robert, I see, so the join needs to consume all data first and process it. In my case, I couldn't wait long because the first join quickly generated a lot of data that can't fit in the memory or in the disk. The solution was then to manually specify a JoinHint and broadcast the small dataset, that way I was able to get an output of the join as soon as data is received.
Thanks a lot for the clarification! Best, Yassine 2016-11-23 17:35 GMT+01:00 Robert Metzger <rmetz...@apache.org>: > Hi Yassine, > > you don't necessarily need to set the parallelism of the last two > operators of 31, the sink with parallelism 1 will fit still into the slots. > A task slot can, by default, hold an entire "slice" or parallel instance > of a job. > > The reason why the sink stays in state CREATE in the beginning is because > it didn't receive any data yet. In batch mode, an operator is switching to > RUNNING once it received the first record. In your case, there are some > operations (reduce, join) before the sink that first need to consume all > data and process it. > If you wait long enough, you should see the sink to become active. > > Regards, > Robert > > > > On Wed, Nov 23, 2016 at 2:03 PM, Yassine MARZOUGUI < > y.marzou...@mindlytix.com> wrote: > >> Hi all, >> >> My batch job has the follwoing plan in the end (figure attached): >> >> >> >> I have a total of 32 task slots, and I have set the parallelism of the >> last two operators before the sink to 31. The sink parallelism is 1. The >> last operator before the sink is a MapOperator, so it doesn't need to >> buffer all elements before emitting the output. >> >> Looking at the dashboard, I see that one task slot is available, but the >> sink subtask stays in the state "CREATED". >> >> Any explanation for this behaviour? Thank you. >> >> Best, >> Yassine >> > >