And how many unions would your program use if you would follow the
union-in-loop approach?

2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:

> In the order of 10 GB..
>
> On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Accumulators can be used to collect records, but they are not designed to
>> hold large amounts of data.
>> It might work up to a certain point (~10MB) and fail beyond that.
>>
>> How many unions do you plan to use in your program?
>>
>>
>>
>> 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>
>>> ok thanks. are there any alternatives to that?may I use accumulators for
>>> that?
>>> On 7 Sep 2015 17:47, "Fabian Hueske" <fhue...@gmail.com> wrote:
>>>
>>>> If the loop count of 3 is fixed (or not significantly larger), union
>>>> should be fine.
>>>>
>>>> 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>
>>>>> Sorry the program has a union at   accumulated = 
>>>>> accumulated.union(x.filter(t.f1
>>>>> == 0))
>>>>>
>>>>> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Flavio,
>>>>>>
>>>>>> your example does not contain a union.
>>>>>>
>>>>>> Union itself basically comes for free. However, if you have a lot of
>>>>>> small DataSet that you want to union, the plan can become very complex 
>>>>>> and
>>>>>> might cause overhead due to scheduling many small tasks. For example, it 
>>>>>> is
>>>>>> usually better to have one data source and input format that reads 
>>>>>> multiple
>>>>>> small files instead of adding one data source for each tiny file and 
>>>>>> apply
>>>>>> union to all data sources to get all data.
>>>>>>
>>>>>> TL;DR; if your iteration count is only 3 as your example suggests you
>>>>>> should be fine. If it exceeds say 32 it might be worth thinking about 
>>>>>> your
>>>>>> program.
>>>>>>
>>>>>> Cheers, Fabian
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>>>>
>>>>>>> Hi Stephan,
>>>>>>> thanks for the answer. Unfortunately I dind't understand if there's
>>>>>>> an alternative to union right now..
>>>>>>> My process is basically like this:
>>>>>>>
>>>>>>> Dataset x = ...
>>>>>>> while(loopCnt < 3){
>>>>>>>    x = x.join(y).where(0).equalTo(0).with());
>>>>>>>    accumulated = x.filter(t.f1 == 0);
>>>>>>>    x =  x.filter(t.f1!=0);
>>>>>>>    loopCnt++;
>>>>>>> }
>>>>>>>
>>>>>>> Best,
>>>>>>> Flavio
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Union, like all operators, is lazy. When you call union, it only
>>>>>>>> builds a "union stream", that unions when you execute the task. So 
>>>>>>>> nothing
>>>>>>>> is added before you call "env.execute()"
>>>>>>>>
>>>>>>>> After you call "env.execute()" and then union again, you will
>>>>>>>> re-execute the entire history of computation to compute the data set 
>>>>>>>> that
>>>>>>>> you union with. Hence, for incremental computations, union() is 
>>>>>>>> probably
>>>>>>>> not a good choice, unless you persist intermediate data (seamless 
>>>>>>>> support
>>>>>>>> for that is WIP).
>>>>>>>>
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <
>>>>>>>> pomperma...@okkam.it> wrote:
>>>>>>>>
>>>>>>>>> Hi to all,
>>>>>>>>> I have a job where I have to incrementally add Tuples to a dataset
>>>>>>>>> (in a while loop).
>>>>>>>>> Is union() the best operator for this task or is there a more
>>>>>>>>> performant operator for this task?
>>>>>>>>> Does union affect the read of already existing elements or it just
>>>>>>>>> appends the new ones somewhere?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Flavio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>
>

Reply via email to