In the order of 10 GB.. On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske <fhue...@gmail.com> wrote:
> Accumulators can be used to collect records, but they are not designed to > hold large amounts of data. > It might work up to a certain point (~10MB) and fail beyond that. > > How many unions do you plan to use in your program? > > > > 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > >> ok thanks. are there any alternatives to that?may I use accumulators for >> that? >> On 7 Sep 2015 17:47, "Fabian Hueske" <fhue...@gmail.com> wrote: >> >>> If the loop count of 3 is fixed (or not significantly larger), union >>> should be fine. >>> >>> 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>> >>>> Sorry the program has a union at accumulated = >>>> accumulated.union(x.filter(t.f1 >>>> == 0)) >>>> >>>> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> >>>> wrote: >>>> >>>>> Hi Flavio, >>>>> >>>>> your example does not contain a union. >>>>> >>>>> Union itself basically comes for free. However, if you have a lot of >>>>> small DataSet that you want to union, the plan can become very complex and >>>>> might cause overhead due to scheduling many small tasks. For example, it >>>>> is >>>>> usually better to have one data source and input format that reads >>>>> multiple >>>>> small files instead of adding one data source for each tiny file and apply >>>>> union to all data sources to get all data. >>>>> >>>>> TL;DR; if your iteration count is only 3 as your example suggests you >>>>> should be fine. If it exceeds say 32 it might be worth thinking about your >>>>> program. >>>>> >>>>> Cheers, Fabian >>>>> >>>>> >>>>> >>>>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>>>> >>>>>> Hi Stephan, >>>>>> thanks for the answer. Unfortunately I dind't understand if there's >>>>>> an alternative to union right now.. >>>>>> My process is basically like this: >>>>>> >>>>>> Dataset x = ... >>>>>> while(loopCnt < 3){ >>>>>> x = x.join(y).where(0).equalTo(0).with()); >>>>>> accumulated = x.filter(t.f1 == 0); >>>>>> x = x.filter(t.f1!=0); >>>>>> loopCnt++; >>>>>> } >>>>>> >>>>>> Best, >>>>>> Flavio >>>>>> >>>>>> >>>>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Union, like all operators, is lazy. When you call union, it only >>>>>>> builds a "union stream", that unions when you execute the task. So >>>>>>> nothing >>>>>>> is added before you call "env.execute()" >>>>>>> >>>>>>> After you call "env.execute()" and then union again, you will >>>>>>> re-execute the entire history of computation to compute the data set >>>>>>> that >>>>>>> you union with. Hence, for incremental computations, union() is probably >>>>>>> not a good choice, unless you persist intermediate data (seamless >>>>>>> support >>>>>>> for that is WIP). >>>>>>> >>>>>>> Stephan >>>>>>> >>>>>>> >>>>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier < >>>>>>> pomperma...@okkam.it> wrote: >>>>>>> >>>>>>>> Hi to all, >>>>>>>> I have a job where I have to incrementally add Tuples to a dataset >>>>>>>> (in a while loop). >>>>>>>> Is union() the best operator for this task or is there a more >>>>>>>> performant operator for this task? >>>>>>>> Does union affect the read of already existing elements or it just >>>>>>>> appends the new ones somewhere? >>>>>>>> >>>>>>>> Best, >>>>>>>> Flavio >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>> >