And how many unions would your program use if you would follow the union-in-loop approach?
2015-09-07 18:31 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > In the order of 10 GB.. > > On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <fhue...@gmail.com> wrote: > >> Accumulators can be used to collect records, but they are not designed to >> hold large amounts of data. >> It might work up to a certain point (~10MB) and fail beyond that. >> >> How many unions do you plan to use in your program? >> >> >> >> 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >> >>> ok thanks. are there any alternatives to that?may I use accumulators for >>> that? >>> On 7 Sep 2015 17:47, "Fabian Hueske" <fhue...@gmail.com> wrote: >>> >>>> If the loop count of 3 is fixed (or not significantly larger), union >>>> should be fine. >>>> >>>> 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>>> >>>>> Sorry the program has a union at accumulated = >>>>> accumulated.union(x.filter(t.f1 >>>>> == 0)) >>>>> >>>>> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Flavio, >>>>>> >>>>>> your example does not contain a union. >>>>>> >>>>>> Union itself basically comes for free. However, if you have a lot of >>>>>> small DataSet that you want to union, the plan can become very complex >>>>>> and >>>>>> might cause overhead due to scheduling many small tasks. For example, it >>>>>> is >>>>>> usually better to have one data source and input format that reads >>>>>> multiple >>>>>> small files instead of adding one data source for each tiny file and >>>>>> apply >>>>>> union to all data sources to get all data. >>>>>> >>>>>> TL;DR; if your iteration count is only 3 as your example suggests you >>>>>> should be fine. If it exceeds say 32 it might be worth thinking about >>>>>> your >>>>>> program. >>>>>> >>>>>> Cheers, Fabian >>>>>> >>>>>> >>>>>> >>>>>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>>>>> >>>>>>> Hi Stephan, >>>>>>> thanks for the answer. Unfortunately I dind't understand if there's >>>>>>> an alternative to union right now.. >>>>>>> My process is basically like this: >>>>>>> >>>>>>> Dataset x = ... >>>>>>> while(loopCnt < 3){ >>>>>>> x = x.join(y).where(0).equalTo(0).with()); >>>>>>> accumulated = x.filter(t.f1 == 0); >>>>>>> x = x.filter(t.f1!=0); >>>>>>> loopCnt++; >>>>>>> } >>>>>>> >>>>>>> Best, >>>>>>> Flavio >>>>>>> >>>>>>> >>>>>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Union, like all operators, is lazy. When you call union, it only >>>>>>>> builds a "union stream", that unions when you execute the task. So >>>>>>>> nothing >>>>>>>> is added before you call "env.execute()" >>>>>>>> >>>>>>>> After you call "env.execute()" and then union again, you will >>>>>>>> re-execute the entire history of computation to compute the data set >>>>>>>> that >>>>>>>> you union with. Hence, for incremental computations, union() is >>>>>>>> probably >>>>>>>> not a good choice, unless you persist intermediate data (seamless >>>>>>>> support >>>>>>>> for that is WIP). >>>>>>>> >>>>>>>> Stephan >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier < >>>>>>>> pomperma...@okkam.it> wrote: >>>>>>>> >>>>>>>>> Hi to all, >>>>>>>>> I have a job where I have to incrementally add Tuples to a dataset >>>>>>>>> (in a while loop). >>>>>>>>> Is union() the best operator for this task or is there a more >>>>>>>>> performant operator for this task? >>>>>>>>> Does union affect the read of already existing elements or it just >>>>>>>>> appends the new ones somewhere? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Flavio >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >> > >