Re: filter().project() vs flatMap()

2015-05-04 Thread Fabian Hueske
That might help with cardinality estimation for cost-based optimization. For example when deciding about join strategies (broadcast vs. repartition, build-side of a hash join). However, as Stephan said, there are many cases where it does not make a difference, e.g. if the input cardinality of the f

Re: filter().project() vs flatMap()

2015-05-04 Thread Sebastian
If the system has to decide data shipping strategies for a join (e.g., broadcasting one side) it helps to have good estimates of the input sizes. On 04.05.2015 14:53, Flavio Pompermaier wrote: Thanks Sebastian and Fabian for the feedback, just one last question: what does change from the system

Re: filter().project() vs flatMap()

2015-05-04 Thread Flavio Pompermaier
Thanks Sebastian and Fabian for the feedback, just one last question: what does change from the system point of view to know that the output tuples is <= the number of input tuples? Is there any optimization that Flink can apply to the pipeline? On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske wrot

Re: filter().project() vs flatMap()

2015-05-04 Thread Stephan Ewen
Hah, interesting to see how opinions differ ;-) Sebastian has a point, that filter + project is more transparent to the system. In some situations, this knowledge can help the optimizer, but often, it will not matter. Greetings, Stephan On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske wrote: > I

Re: filter().project() vs flatMap()

2015-05-04 Thread Fabian Hueske
It should not make a difference. I think its just personal taste. If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction. 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier : > Hi Flinkers, > > I'd like to know wheth

Re: filter().project() vs flatMap()

2015-05-04 Thread Sebastian
filter + project is easier to understand for the system, as the number of output tuples is guaranteed to be <= the number of input tuples. With flatMap, the system cannot know an upper bound. --sebastian On 04.05.2015 14:43, Flavio Pompermaier wrote: Hi Flinkers, I'd like to know whether it'

filter().project() vs flatMap()

2015-05-04 Thread Flavio Pompermaier
Hi Flinkers, I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something.. Thanks in advance, Flavio