Accumulators can be used to collect records, but they are not designed to hold large amounts of data. It might work up to a certain point (~10MB) and fail beyond that.
How many unions do you plan to use in your program? 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: > ok thanks. are there any alternatives to that?may I use accumulators for > that? > On 7 Sep 2015 17:47, "Fabian Hueske" <fhue...@gmail.com> wrote: > >> If the loop count of 3 is fixed (or not significantly larger), union >> should be fine. >> >> 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >> >>> Sorry the program has a union at accumulated = >>> accumulated.union(x.filter(t.f1 >>> == 0)) >>> >>> On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske <fhue...@gmail.com> wrote: >>> >>>> Hi Flavio, >>>> >>>> your example does not contain a union. >>>> >>>> Union itself basically comes for free. However, if you have a lot of >>>> small DataSet that you want to union, the plan can become very complex and >>>> might cause overhead due to scheduling many small tasks. For example, it is >>>> usually better to have one data source and input format that reads multiple >>>> small files instead of adding one data source for each tiny file and apply >>>> union to all data sources to get all data. >>>> >>>> TL;DR; if your iteration count is only 3 as your example suggests you >>>> should be fine. If it exceeds say 32 it might be worth thinking about your >>>> program. >>>> >>>> Cheers, Fabian >>>> >>>> >>>> >>>> 2015-09-07 16:29 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>: >>>> >>>>> Hi Stephan, >>>>> thanks for the answer. Unfortunately I dind't understand if there's an >>>>> alternative to union right now.. >>>>> My process is basically like this: >>>>> >>>>> Dataset x = ... >>>>> while(loopCnt < 3){ >>>>> x = x.join(y).where(0).equalTo(0).with()); >>>>> accumulated = x.filter(t.f1 == 0); >>>>> x = x.filter(t.f1!=0); >>>>> loopCnt++; >>>>> } >>>>> >>>>> Best, >>>>> Flavio >>>>> >>>>> >>>>> On Mon, Sep 7, 2015 at 3:15 PM, Stephan Ewen <se...@apache.org> wrote: >>>>> >>>>>> Union, like all operators, is lazy. When you call union, it only >>>>>> builds a "union stream", that unions when you execute the task. So >>>>>> nothing >>>>>> is added before you call "env.execute()" >>>>>> >>>>>> After you call "env.execute()" and then union again, you will >>>>>> re-execute the entire history of computation to compute the data set that >>>>>> you union with. Hence, for incremental computations, union() is probably >>>>>> not a good choice, unless you persist intermediate data (seamless support >>>>>> for that is WIP). >>>>>> >>>>>> Stephan >>>>>> >>>>>> >>>>>> On Mon, Sep 7, 2015 at 2:56 PM, Flavio Pompermaier < >>>>>> pomperma...@okkam.it> wrote: >>>>>> >>>>>>> Hi to all, >>>>>>> I have a job where I have to incrementally add Tuples to a dataset >>>>>>> (in a while loop). >>>>>>> Is union() the best operator for this task or is there a more >>>>>>> performant operator for this task? >>>>>>> Does union affect the read of already existing elements or it just >>>>>>> appends the new ones somewhere? >>>>>>> >>>>>>> Best, >>>>>>> Flavio >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>