ok thanks. are there any alternatives to that?may I use accumulators for that? On 7 Sep 2015 17:47, "Fabian Hueske" <fhue...@gmail.com> wrote:
> If the loop count of 3 is fixed (or not significantly larger), union > should be fine. > > 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > >> Sorry the program has a union at accumulated = >> accumulated.union(x.filter(t.f1 >> == 0)) >> >> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> wrote: >> >>> Hi Flavio, >>> >>> your example does not contain a union. >>> >>> Union itself basically comes for free. However, if you have a lot of >>> small DataSet that you want to union, the plan can become very complex and >>> might cause overhead due to scheduling many small tasks. For example, it is >>> usually better to have one data source and input format that reads multiple >>> small files instead of adding one data source for each tiny file and apply >>> union to all data sources to get all data. >>> >>> TL;DR; if your iteration count is only 3 as your example suggests you >>> should be fine. If it exceeds say 32 it might be worth thinking about your >>> program. >>> >>> Cheers, Fabian >>> >>> >>> >>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>> >>>> Hi Stephan, >>>> thanks for the answer. Unfortunately I dind't understand if there's an >>>> alternative to union right now.. >>>> My process is basically like this: >>>> >>>> Dataset x = ... >>>> while(loopCnt < 3){ >>>> x = x.join(y).where(0).equalTo(0).with()); >>>> accumulated = x.filter(t.f1 == 0); >>>> x = x.filter(t.f1!=0); >>>> loopCnt++; >>>> } >>>> >>>> Best, >>>> Flavio >>>> >>>> >>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> wrote: >>>> >>>>> Union, like all operators, is lazy. When you call union, it only >>>>> builds a "union stream", that unions when you execute the task. So nothing >>>>> is added before you call "env.execute()" >>>>> >>>>> After you call "env.execute()" and then union again, you will >>>>> re-execute the entire history of computation to compute the data set that >>>>> you union with. Hence, for incremental computations, union() is probably >>>>> not a good choice, unless you persist intermediate data (seamless support >>>>> for that is WIP). >>>>> >>>>> Stephan >>>>> >>>>> >>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier < >>>>> pomperma...@okkam.it> wrote: >>>>> >>>>>> Hi to all, >>>>>> I have a job where I have to incrementally add Tuples to a dataset >>>>>> (in a while loop). >>>>>> Is union() the best operator for this task or is there a more >>>>>> performant operator for this task? >>>>>> Does union affect the read of already existing elements or it just >>>>>> appends the new ones somewhere? >>>>>> >>>>>> Best, >>>>>> Flavio >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >> >