3 or 4 usually.. On 7 Sep 2015 18:39, "Fabian Hueske" <fhue...@gmail.com> wrote:
> And how many unions would your program use if you would follow the > union-in-loop approach? > > 2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > >> In the order of 10 GB.. >> >> On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <fhue...@gmail.com> wrote: >> >>> Accumulators can be used to collect records, but they are not designed >>> to hold large amounts of data. >>> It might work up to a certain point (~10MB) and fail beyond that. >>> >>> How many unions do you plan to use in your program? >>> >>> >>> >>> 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>> >>>> ok thanks. are there any alternatives to that?may I use accumulators >>>> for that? >>>> On 7 Sep 2015 17:47, "Fabian Hueske" <fhue...@gmail.com> wrote: >>>> >>>>> If the loop count of 3 is fixed (or not significantly larger), union >>>>> should be fine. >>>>> >>>>> 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>>>> >>>>>> Sorry the program has a union at accumulated = >>>>>> accumulated.union(x.filter(t.f1 >>>>>> == 0)) >>>>>> >>>>>> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Flavio, >>>>>>> >>>>>>> your example does not contain a union. >>>>>>> >>>>>>> Union itself basically comes for free. However, if you have a lot of >>>>>>> small DataSet that you want to union, the plan can become very complex >>>>>>> and >>>>>>> might cause overhead due to scheduling many small tasks. For example, >>>>>>> it is >>>>>>> usually better to have one data source and input format that reads >>>>>>> multiple >>>>>>> small files instead of adding one data source for each tiny file and >>>>>>> apply >>>>>>> union to all data sources to get all data. >>>>>>> >>>>>>> TL;DR; if your iteration count is only 3 as your example suggests >>>>>>> you should be fine. If it exceeds say 32 it might be worth thinking >>>>>>> about >>>>>>> your program. >>>>>>> >>>>>>> Cheers, Fabian >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it> >>>>>>> : >>>>>>> >>>>>>>> Hi Stephan, >>>>>>>> thanks for the answer. Unfortunately I dind't understand if there's >>>>>>>> an alternative to union right now.. >>>>>>>> My process is basically like this: >>>>>>>> >>>>>>>> Dataset x = ... >>>>>>>> while(loopCnt < 3){ >>>>>>>> x = x.join(y).where(0).equalTo(0).with()); >>>>>>>> accumulated = x.filter(t.f1 == 0); >>>>>>>> x = x.filter(t.f1!=0); >>>>>>>> loopCnt++; >>>>>>>> } >>>>>>>> >>>>>>>> Best, >>>>>>>> Flavio >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Union, like all operators, is lazy. When you call union, it only >>>>>>>>> builds a "union stream", that unions when you execute the task. So >>>>>>>>> nothing >>>>>>>>> is added before you call "env.execute()" >>>>>>>>> >>>>>>>>> After you call "env.execute()" and then union again, you will >>>>>>>>> re-execute the entire history of computation to compute the data set >>>>>>>>> that >>>>>>>>> you union with. Hence, for incremental computations, union() is >>>>>>>>> probably >>>>>>>>> not a good choice, unless you persist intermediate data (seamless >>>>>>>>> support >>>>>>>>> for that is WIP). >>>>>>>>> >>>>>>>>> Stephan >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier < >>>>>>>>> pomperma...@okkam.it> wrote: >>>>>>>>> >>>>>>>>>> Hi to all, >>>>>>>>>> I have a job where I have to incrementally add Tuples to a >>>>>>>>>> dataset (in a while loop). >>>>>>>>>> Is union() the best operator for this task or is there a more >>>>>>>>>> performant operator for this task? >>>>>>>>>> Does union affect the read of already existing elements or it >>>>>>>>>> just appends the new ones somewhere? >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Flavio >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>> >> >> >