Re: Union/append performance question

2015-09-07 Thread Fabian Hueske
In that case you should go with union. 2015-09-07 19:06 GMT+02:00 Flavio Pompermaier : > 3 or 4 usually.. > On 7 Sep 2015 18:39, "Fabian Hueske" wrote: > >> And how many unions would your program use if you would follow the >> union-in-loop approach? >> >> 2015-09-07 18:31 GMT+02:00 Flavio Pompe

Re: Union/append performance question

2015-09-07 Thread Flavio Pompermaier
3 or 4 usually.. On 7 Sep 2015 18:39, "Fabian Hueske" wrote: > And how many unions would your program use if you would follow the > union-in-loop approach? > > 2015-09-07 18:31 GMT+02:00 Flavio Pompermaier : > >> In the order of 10 GB.. >> >> On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske wrote:

Re: Union/append performance question

2015-09-07 Thread Fabian Hueske
And how many unions would your program use if you would follow the union-in-loop approach? 2015-09-07 18:31 GMT+02:00 Flavio Pompermaier : > In the order of 10 GB.. > > On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske wrote: > >> Accumulators can be used to collect records, but they are not designe

Re: Union/append performance question

2015-09-07 Thread Flavio Pompermaier
In the order of 10 GB.. On Mon, Sep 7, 2015 at 6:14 PM, Fabian Hueske wrote: > Accumulators can be used to collect records, but they are not designed to > hold large amounts of data. > It might work up to a certain point (~10MB) and fail beyond that. > > How many unions do you plan to use in you

Re: Union/append performance question

2015-09-07 Thread Fabian Hueske
Accumulators can be used to collect records, but they are not designed to hold large amounts of data. It might work up to a certain point (~10MB) and fail beyond that. How many unions do you plan to use in your program? 2015-09-07 17:58 GMT+02:00 Flavio Pompermaier : > ok thanks. are there any

Re: Union/append performance question

2015-09-07 Thread Flavio Pompermaier
ok thanks. are there any alternatives to that?may I use accumulators for that? On 7 Sep 2015 17:47, "Fabian Hueske" wrote: > If the loop count of 3 is fixed (or not significantly larger), union > should be fine. > > 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier : > >> Sorry the program has a unio

Re: Union/append performance question

2015-09-07 Thread Fabian Hueske
If the loop count of 3 is fixed (or not significantly larger), union should be fine. 2015-09-07 17:07 GMT+02:00 Flavio Pompermaier : > Sorry the program has a union at accumulated = > accumulated.union(x.filter(t.f1 > == 0)) > > On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske wrote: > >> Hi Fla

Re: Union/append performance question

2015-09-07 Thread Flavio Pompermaier
Sorry the program has a union at accumulated = accumulated.union(x.filter(t.f1 == 0)) On Mon, Sep 7, 2015 at 4:58 PM, Fabian Hueske wrote: > Hi Flavio, > > your example does not contain a union. > > Union itself basically comes for free. However, if you have a lot of small > DataSet that you w

Re: Union/append performance question

2015-09-07 Thread Fabian Hueske
Hi Flavio, your example does not contain a union. Union itself basically comes for free. However, if you have a lot of small DataSet that you want to union, the plan can become very complex and might cause overhead due to scheduling many small tasks. For example, it is usually better to have one

Re: Union/append performance question

2015-09-07 Thread Flavio Pompermaier
Hi Stephan, thanks for the answer. Unfortunately I dind't understand if there's an alternative to union right now.. My process is basically like this: Dataset x = ... while(loopCnt < 3){ x = x.join(y).where(0).equalTo(0).with()); accumulated = x.filter(t.f1 == 0); x = x.filter(t.f1!=0);

Re: Union/append performance question

2015-09-07 Thread Stephan Ewen
Union, like all operators, is lazy. When you call union, it only builds a "union stream", that unions when you execute the task. So nothing is added before you call "env.execute()" After you call "env.execute()" and then union again, you will re-execute the entire history of computation to compute

Union/append performance question

2015-09-07 Thread Flavio Pompermaier
Hi to all, I have a job where I have to incrementally add Tuples to a dataset (in a while loop). Is union() the best operator for this task or is there a more performant operator for this task? Does union affect the read of already existing elements or it just appends the new ones somewhere? Best,