Re: Akka timeout

2015-05-04 Thread Stephan Ewen
Yes, the issue I mentioned is solved in the next release. On Tue, May 5, 2015 at 12:56 AM, Flavio Pompermaier wrote: > Thanks for the support! so this issue will be solved in the next 0.9 > release? > > On Tue, May 5, 2015 at 12:17 AM, Stephan Ewen wrote: > >> Here is a list of all values you c

Re: Akka timeout

2015-05-04 Thread Flavio Pompermaier
Thanks for the support! so this issue will be solved in the next 0.9 release? On Tue, May 5, 2015 at 12:17 AM, Stephan Ewen wrote: > Here is a list of all values you can set: > http://ci.apache.org/projects/flink/flink-docs-master/setup/config.html > > On Tue, May 5, 2015 at 12:17 AM, Stephan Ew

Re: Akka timeout

2015-05-04 Thread Stephan Ewen
Here is a list of all values you can set: http://ci.apache.org/projects/flink/flink-docs-master/setup/config.html On Tue, May 5, 2015 at 12:17 AM, Stephan Ewen wrote: > Hi Flavio! > > This may be a known and fixed issue. It relates to the fact that task > deployment may take long in case of big

Re: Akka timeout

2015-05-04 Thread Stephan Ewen
Hi Flavio! This may be a known and fixed issue. It relates to the fact that task deployment may take long in case of big jar files. The current master should not have this issue any more, but 0.9-SNAPSHOT has it. As a temporary workaround, you can increase "akka.ask.timeout"in the flink configura

Akka timeout

2015-05-04 Thread Flavio Pompermaier
Hi to all, In my current (local) job I receive a lot of Akka timeout errors during task deploy at: org.apache.flink.runtime.executiongraph.Execution$2.onComplete(Execution.java:342) is it normal? Which parameter do I have to increase? Best, Flavio

RE: Crash on DataSet.collect()

2015-05-04 Thread Flavio Baronti
Hi Stephan, I confirm that I was using custom types in the collect(), and that the bug is not present in the master. Thanks Flavio From: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] On Behalf Of Stephan Ewen Sent: Monday, May 04, 2015 2:33 PM To: user@flink.apache.org Subj

Re: filter().project() vs flatMap()

2015-05-04 Thread Fabian Hueske
That might help with cardinality estimation for cost-based optimization. For example when deciding about join strategies (broadcast vs. repartition, build-side of a hash join). However, as Stephan said, there are many cases where it does not make a difference, e.g. if the input cardinality of the f

Re: Best way to join with inequalities (historical data)

2015-05-04 Thread Stephan Ewen
When you use cross() it is always goot to use crossWithLarge or crossWithTiny to tell the system which side is small. It can not always infer that automagically at this point. If you have an optimized structure for the lookups, go with a broadcast variable and a map() function. On Mon, May 4, 201

Re: filter().project() vs flatMap()

2015-05-04 Thread Sebastian
If the system has to decide data shipping strategies for a join (e.g., broadcasting one side) it helps to have good estimates of the input sizes. On 04.05.2015 14:53, Flavio Pompermaier wrote: Thanks Sebastian and Fabian for the feedback, just one last question: what does change from the system

Re: filter().project() vs flatMap()

2015-05-04 Thread Flavio Pompermaier
Thanks Sebastian and Fabian for the feedback, just one last question: what does change from the system point of view to know that the output tuples is <= the number of input tuples? Is there any optimization that Flink can apply to the pipeline? On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske wrot

Re: filter().project() vs flatMap()

2015-05-04 Thread Stephan Ewen
Hah, interesting to see how opinions differ ;-) Sebastian has a point, that filter + project is more transparent to the system. In some situations, this knowledge can help the optimizer, but often, it will not matter. Greetings, Stephan On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske wrote: > I

Re: filter().project() vs flatMap()

2015-05-04 Thread Fabian Hueske
It should not make a difference. I think its just personal taste. If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction. 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier : > Hi Flinkers, > > I'd like to know wheth

Re: filter().project() vs flatMap()

2015-05-04 Thread Sebastian
filter + project is easier to understand for the system, as the number of output tuples is guaranteed to be <= the number of input tuples. With flatMap, the system cannot know an upper bound. --sebastian On 04.05.2015 14:43, Flavio Pompermaier wrote: Hi Flinkers, I'd like to know whether it'

filter().project() vs flatMap()

2015-05-04 Thread Flavio Pompermaier
Hi Flinkers, I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something.. Thanks in advance, Flavio

Re: Crash on DataSet.collect()

2015-05-04 Thread Stephan Ewen
Hi Flavio! This issue is known and has been fixed already. It occurs when you use custom types in collect, because it uses the wrong classloader/serializer to transfer them. The current master should not have this issue any more. Greetings, Stephan On Mon, May 4, 2015 at 2:09 PM, Flavio Baront

Crash on DataSet.collect()

2015-05-04 Thread Flavio Baronti
Hello, I'm testing the new DataSet.collect() method on version 0.9-milestone-1, but I obtain the following error on cluster execution (no problem with local execution), which also causes the job manager to crash: 14:05:41,145 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deployin

RE: Best way to join with inequalities (historical data)

2015-05-04 Thread LINZ, Arnaud
Hi, Thanks. The use case I have right now does not require too much magic ; my historical data set is small enough to fit in RAM, I'll spread it over each node and use a simple mapping with a log(n) look up. It was more a theorical question. If my dataset becomes too large, I may use some hashi

Re: Best way to join with inequalities (historical data)

2015-05-04 Thread Matthias J. Sax
Hi, there is no other system support to express this join. However, you could perform some "hand wired" optimization by partitioning your input data into distinct intervals. It might be tricky though. Especially, if the time-ranges in your "range-key" dataset are overlapping everywhere (-> data r

Best way to join with inequalities (historical data)

2015-05-04 Thread LINZ, Arnaud
Hello, I was wondering how to join large data sets on inequalities. Let say I have a data set whose “keys” are two timestamps (start time & end time of validity) and value is a label : final DataSet> historical = …; I also have events, with an event name and a timestamp : final