If the loop count of 3 is fixed (or not significantly larger), union should
be fine.

2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:

> Sorry the program has a union at   accumulated = 
> accumulated.union(x.filter(t.f1
> == 0))
>
> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi Flavio,
>>
>> your example does not contain a union.
>>
>> Union itself basically comes for free. However, if you have a lot of
>> small DataSet that you want to union, the plan can become very complex and
>> might cause overhead due to scheduling many small tasks. For example, it is
>> usually better to have one data source and input format that reads multiple
>> small files instead of adding one data source for each tiny file and apply
>> union to all data sources to get all data.
>>
>> TL;DR; if your iteration count is only 3 as your example suggests you
>> should be fine. If it exceeds say 32 it might be worth thinking about your
>> program.
>>
>> Cheers, Fabian
>>
>>
>>
>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>
>>> Hi Stephan,
>>> thanks for the answer. Unfortunately I dind't understand if there's an
>>> alternative to union right now..
>>> My process is basically like this:
>>>
>>> Dataset x = ...
>>> while(loopCnt < 3){
>>>    x = x.join(y).where(0).equalTo(0).with());
>>>    accumulated = x.filter(t.f1 == 0);
>>>    x =  x.filter(t.f1!=0);
>>>    loopCnt++;
>>> }
>>>
>>> Best,
>>> Flavio
>>>
>>>
>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> Union, like all operators, is lazy. When you call union, it only builds
>>>> a "union stream", that unions when you execute the task. So nothing is
>>>> added before you call "env.execute()"
>>>>
>>>> After you call "env.execute()" and then union again, you will
>>>> re-execute the entire history of computation to compute the data set that
>>>> you union with. Hence, for incremental computations, union() is probably
>>>> not a good choice, unless you persist intermediate data (seamless support
>>>> for that is WIP).
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <
>>>> pomperma...@okkam.it> wrote:
>>>>
>>>>> Hi to all,
>>>>> I have a job where I have to incrementally add Tuples to a dataset (in
>>>>> a while loop).
>>>>> Is union() the best operator for this task or is there a more
>>>>> performant operator for this task?
>>>>> Does union affect the read of already existing elements or it just
>>>>> appends the new ones somewhere?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

Reply via email to