Accumulators can be used to collect records, but they are not designed to
hold large amounts of data.
It might work up to a certain point (~10MB) and fail beyond that.

How many unions do you plan to use in your program?



2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:

> ok thanks. are there any alternatives to that?may I use accumulators for
> that?
> On 7 Sep 2015 17:47, "Fabian Hueske" <fhue...@gmail.com> wrote:
>
>> If the loop count of 3 is fixed (or not significantly larger), union
>> should be fine.
>>
>> 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>
>>> Sorry the program has a union at   accumulated = 
>>> accumulated.union(x.filter(t.f1
>>> == 0))
>>>
>>> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>> your example does not contain a union.
>>>>
>>>> Union itself basically comes for free. However, if you have a lot of
>>>> small DataSet that you want to union, the plan can become very complex and
>>>> might cause overhead due to scheduling many small tasks. For example, it is
>>>> usually better to have one data source and input format that reads multiple
>>>> small files instead of adding one data source for each tiny file and apply
>>>> union to all data sources to get all data.
>>>>
>>>> TL;DR; if your iteration count is only 3 as your example suggests you
>>>> should be fine. If it exceeds say 32 it might be worth thinking about your
>>>> program.
>>>>
>>>> Cheers, Fabian
>>>>
>>>>
>>>>
>>>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>
>>>>> Hi Stephan,
>>>>> thanks for the answer. Unfortunately I dind't understand if there's an
>>>>> alternative to union right now..
>>>>> My process is basically like this:
>>>>>
>>>>> Dataset x = ...
>>>>> while(loopCnt < 3){
>>>>>    x = x.join(y).where(0).equalTo(0).with());
>>>>>    accumulated = x.filter(t.f1 == 0);
>>>>>    x =  x.filter(t.f1!=0);
>>>>>    loopCnt++;
>>>>> }
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>>
>>>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>>
>>>>>> Union, like all operators, is lazy. When you call union, it only
>>>>>> builds a "union stream", that unions when you execute the task. So 
>>>>>> nothing
>>>>>> is added before you call "env.execute()"
>>>>>>
>>>>>> After you call "env.execute()" and then union again, you will
>>>>>> re-execute the entire history of computation to compute the data set that
>>>>>> you union with. Hence, for incremental computations, union() is probably
>>>>>> not a good choice, unless you persist intermediate data (seamless support
>>>>>> for that is WIP).
>>>>>>
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <
>>>>>> pomperma...@okkam.it> wrote:
>>>>>>
>>>>>>> Hi to all,
>>>>>>> I have a job where I have to incrementally add Tuples to a dataset
>>>>>>> (in a while loop).
>>>>>>> Is union() the best operator for this task or is there a more
>>>>>>> performant operator for this task?
>>>>>>> Does union affect the read of already existing elements or it just
>>>>>>> appends the new ones somewhere?
>>>>>>>
>>>>>>> Best,
>>>>>>> Flavio
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Reply via email to