If the loop count of 3 is fixed (or not significantly larger), union should be fine.
2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > Sorry the program has a union at accumulated = > accumulated.union(x.filter(t.f1 > == 0)) > > On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi Flavio, >> >> your example does not contain a union. >> >> Union itself basically comes for free. However, if you have a lot of >> small DataSet that you want to union, the plan can become very complex and >> might cause overhead due to scheduling many small tasks. For example, it is >> usually better to have one data source and input format that reads multiple >> small files instead of adding one data source for each tiny file and apply >> union to all data sources to get all data. >> >> TL;DR; if your iteration count is only 3 as your example suggests you >> should be fine. If it exceeds say 32 it might be worth thinking about your >> program. >> >> Cheers, Fabian >> >> >> >> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >> >>> Hi Stephan, >>> thanks for the answer. Unfortunately I dind't understand if there's an >>> alternative to union right now.. >>> My process is basically like this: >>> >>> Dataset x = ... >>> while(loopCnt < 3){ >>> x = x.join(y).where(0).equalTo(0).with()); >>> accumulated = x.filter(t.f1 == 0); >>> x = x.filter(t.f1!=0); >>> loopCnt++; >>> } >>> >>> Best, >>> Flavio >>> >>> >>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> wrote: >>> >>>> Union, like all operators, is lazy. When you call union, it only builds >>>> a "union stream", that unions when you execute the task. So nothing is >>>> added before you call "env.execute()" >>>> >>>> After you call "env.execute()" and then union again, you will >>>> re-execute the entire history of computation to compute the data set that >>>> you union with. Hence, for incremental computations, union() is probably >>>> not a good choice, unless you persist intermediate data (seamless support >>>> for that is WIP). >>>> >>>> Stephan >>>> >>>> >>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier < >>>> pomperma...@okkam.it> wrote: >>>> >>>>> Hi to all, >>>>> I have a job where I have to incrementally add Tuples to a dataset (in >>>>> a while loop). >>>>> Is union() the best operator for this task or is there a more >>>>> performant operator for this task? >>>>> Does union affect the read of already existing elements or it just >>>>> appends the new ones somewhere? >>>>> >>>>> Best, >>>>> Flavio >>>>> >>>>> >>>>> >>>> >>> >> > >