Re: Union/append performance question

Flavio Pompermaier Mon, 07 Sep 2015 08:09:04 -0700

Sorry the program has a union at   accumulated =
accumulated.union(x.filter(t.f1
== 0))


On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> wrote:

> Hi Flavio,
>
> your example does not contain a union.
>
> Union itself basically comes for free. However, if you have a lot of small
> DataSet that you want to union, the plan can become very complex and might
> cause overhead due to scheduling many small tasks. For example, it is
> usually better to have one data source and input format that reads multiple
> small files instead of adding one data source for each tiny file and apply
> union to all data sources to get all data.
>
> TL;DR; if your iteration count is only 3 as your example suggests you
> should be fine. If it exceeds say 32 it might be worth thinking about your
> program.
>
> Cheers, Fabian
>
>
>
> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>
>> Hi Stephan,
>> thanks for the answer. Unfortunately I dind't understand if there's an
>> alternative to union right now..
>> My process is basically like this:
>>
>> Dataset x = ...
>> while(loopCnt < 3){
>>    x = x.join(y).where(0).equalTo(0).with());
>>    accumulated = x.filter(t.f1 == 0);
>>    x =  x.filter(t.f1!=0);
>>    loopCnt++;
>> }
>>
>> Best,
>> Flavio
>>
>>
>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> Union, like all operators, is lazy. When you call union, it only builds
>>> a "union stream", that unions when you execute the task. So nothing is
>>> added before you call "env.execute()"
>>>
>>> After you call "env.execute()" and then union again, you will re-execute
>>> the entire history of computation to compute the data set that you union
>>> with. Hence, for incremental computations, union() is probably not a good
>>> choice, unless you persist intermediate data (seamless support for that is
>>> WIP).
>>>
>>> Stephan
>>>
>>>
>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <pomperma...@okkam.it
>>> > wrote:
>>>
>>>> Hi to all,
>>>> I have a job where I have to incrementally add Tuples to a dataset (in
>>>> a while loop).
>>>> Is union() the best operator for this task or is there a more
>>>> performant operator for this task?
>>>> Does union affect the read of already existing elements or it just
>>>> appends the new ones somewhere?
>>>>
>>>> Best,
>>>> Flavio
>>>>
>>>>
>>>>
>>>
>>
>

Re: Union/append performance question

Reply via email to