Hi Robert,

I see, so the join needs to consume all data first and process it.
In my case, I couldn't wait long because the first join quickly generated a
lot of data that can't fit in the memory or in the disk. The solution was
then to manually specify a JoinHint and broadcast the small dataset, that
way I was able to get an output of the join as soon as data is received.

Thanks a lot for the clarification!

Best,
Yassine

2016-11-23 17:35 GMT+01:00 Robert Metzger <rmetz...@apache.org>:

> Hi Yassine,
>
> you don't necessarily need to set the parallelism of the last two
> operators of 31, the sink with parallelism 1 will fit still into the slots.
> A task slot can, by default, hold an entire "slice" or parallel instance
> of a job.
>
> The reason why the sink stays in state CREATE in the beginning is because
> it didn't receive any data yet. In batch mode, an operator is switching to
> RUNNING once it received the first record. In your case, there are some
> operations (reduce, join) before the sink that first need to consume all
> data and process it.
> If you wait long enough, you should see the sink to become active.
>
> Regards,
> Robert
>
>
>
> On Wed, Nov 23, 2016 at 2:03 PM, Yassine MARZOUGUI <
> y.marzou...@mindlytix.com> wrote:
>
>> Hi all,
>>
>> My batch job has the follwoing plan in the end (figure attached):
>>
>>
>> ​
>> I have a total of 32 task slots, and I have set the parallelism of the
>> last two operators before the sink to 31. The sink parallelism is 1. The
>> last operator before the sink is a MapOperator, so it doesn't need to
>> buffer all elements before emitting the output.
>>
>> Looking at the dashboard, I see that one task slot is available, but the
>> sink subtask stays in the state "CREATED".
>>
>> Any explanation for this behaviour? Thank you.
>>
>> Best,
>> Yassine
>>
>
>

Reply via email to