Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0))
On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> wrote: > Hi Flavio, > > your example does not contain a union. > > Union itself basically comes for free. However, if you have a lot of small > DataSet that you want to union, the plan can become very complex and might > cause overhead due to scheduling many small tasks. For example, it is > usually better to have one data source and input format that reads multiple > small files instead of adding one data source for each tiny file and apply > union to all data sources to get all data. > > TL;DR; if your iteration count is only 3 as your example suggests you > should be fine. If it exceeds say 32 it might be worth thinking about your > program. > > Cheers, Fabian > > > > 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > >> Hi Stephan, >> thanks for the answer. Unfortunately I dind't understand if there's an >> alternative to union right now.. >> My process is basically like this: >> >> Dataset x = ... >> while(loopCnt < 3){ >> x = x.join(y).where(0).equalTo(0).with()); >> accumulated = x.filter(t.f1 == 0); >> x = x.filter(t.f1!=0); >> loopCnt++; >> } >> >> Best, >> Flavio >> >> >> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> wrote: >> >>> Union, like all operators, is lazy. When you call union, it only builds >>> a "union stream", that unions when you execute the task. So nothing is >>> added before you call "env.execute()" >>> >>> After you call "env.execute()" and then union again, you will re-execute >>> the entire history of computation to compute the data set that you union >>> with. Hence, for incremental computations, union() is probably not a good >>> choice, unless you persist intermediate data (seamless support for that is >>> WIP). >>> >>> Stephan >>> >>> >>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier <pomperma...@okkam.it >>> > wrote: >>> >>>> Hi to all, >>>> I have a job where I have to incrementally add Tuples to a dataset (in >>>> a while loop). >>>> Is union() the best operator for this task or is there a more >>>> performant operator for this task? >>>> Does union affect the read of already existing elements or it just >>>> appends the new ones somewhere? >>>> >>>> Best, >>>> Flavio >>>> >>>> >>>> >>> >> >