Re: Design documents for consolidated DataStream API

2015-07-14 Thread Gyula Fóra
I think Marton has some good points here. 1) Is KeyedDataStream a better name if this is only a renaming? 2) the discretize semantics is unclear indeed. Are we operating on a single or sequence of datasets? If the latter why not call it something else (dstream). How are joins and other binary ope

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Stephan Ewen
Concerning your comments: 1) In the new design, there is no grouping without windowing. The KeyedDataStream subsumes the grouping and key-ing for partitioned state. The keyBy() + window() makes a parallel grouped window keyBy() alone allows access to partitioned state. My thought was

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Gyula Fóra
If we only want to have either keyBy or groupBy, why not keep groupBy? That would be more consistent with the batch api. On Tue, Jul 14, 2015 at 10:35 AM Stephan Ewen wrote: > Concerning your comments: > > 1) In the new design, there is no grouping without windowing. The > KeyedDataStream subsume

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Stephan Ewen
keyBy() does not do any grouping. Grouping in streams in not defined without windows. On Tue, Jul 14, 2015 at 10:48 AM, Gyula Fóra wrote: > If we only want to have either keyBy or groupBy, why not keep groupBy? That > would be more consistent with the batch api. > On Tue, Jul 14, 2015 at 10:35 A

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Stephan Ewen
It is not a bit different than the batch API, because streaming semantics are a bit different ;-) One good thing is that we can make things better that were sub-optimal in the Batch API. On Tue, Jul 14, 2015 at 10:55 AM, Stephan Ewen wrote: > keyBy() does not do any grouping. Grouping in stream

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Aljoscha Krettek
I agree, the groupBy, in the batch API is misleading, since a ds.groupBy().reduce() does not really build any groups, it is really a ds.keyBy().reduceByKey(). In the streaming API we can still fix this, IMHO. On Tue, 14 Jul 2015 at 10:56 Stephan Ewen wrote: > It is not a bit different than the b

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Gyula Fóra
I see your point, reduceByKey is much clearer. The question is whether we want to introduce this inconsistency across the two api-s or stick with what we have. On Tue, Jul 14, 2015 at 10:57 AM Aljoscha Krettek wrote: > I agree, the groupBy, in the batch API is misleading, since a > ds.groupBy().

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Kostas Tzoumas
I think the though was to explicitly not have the same terminology as the batch API to not confuse people. But this is a minor naming issue IMO. On Tue, Jul 14, 2015 at 12:40 PM, Gyula Fóra wrote: > I see your point, reduceByKey is much clearer. > > The question is whether we want to introduce

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Stephan Ewen
There is no inconsistency between the Batch and Streaming API. They have different semantics - the batch API is implicitly always windowed. There is a naming difference between the two APIs. There is a strong inconsistency within the Streaming API right now. Grouping and aggregating without windo

[jira] [Created] (FLINK-2354) Recover running jobs on JobManager failure

2015-07-14 Thread Ufuk Celebi (JIRA)
Ufuk Celebi created FLINK-2354: -- Summary: Recover running jobs on JobManager failure Key: FLINK-2354 URL: https://issues.apache.org/jira/browse/FLINK-2354 Project: Flink Issue Type: Sub-task

[jira] [Created] (FLINK-2355) Job hanging in collector, waiting for request buffer

2015-07-14 Thread William Saar (JIRA)
William Saar created FLINK-2355: --- Summary: Job hanging in collector, waiting for request buffer Key: FLINK-2355 URL: https://issues.apache.org/jira/browse/FLINK-2355 Project: Flink Issue Type:

[jira] [Created] (FLINK-2356) Resource leak in checkpoint coordinator

2015-07-14 Thread Ufuk Celebi (JIRA)
Ufuk Celebi created FLINK-2356: -- Summary: Resource leak in checkpoint coordinator Key: FLINK-2356 URL: https://issues.apache.org/jira/browse/FLINK-2356 Project: Flink Issue Type: Bug C

[jira] [Created] (FLINK-2357) New JobManager Runtime Web Frontend

2015-07-14 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2357: --- Summary: New JobManager Runtime Web Frontend Key: FLINK-2357 URL: https://issues.apache.org/jira/browse/FLINK-2357 Project: Flink Issue Type: New Feature

[jira] [Created] (FLINK-2358) Add Netty-HTTP based server and server handlers

2015-07-14 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2358: --- Summary: Add Netty-HTTP based server and server handlers Key: FLINK-2358 URL: https://issues.apache.org/jira/browse/FLINK-2358 Project: Flink Issue Type: Sub-t

[jira] [Created] (FLINK-2359) Add factory methods to the Java TupleX types

2015-07-14 Thread Gabor Gevay (JIRA)
Gabor Gevay created FLINK-2359: -- Summary: Add factory methods to the Java TupleX types Key: FLINK-2359 URL: https://issues.apache.org/jira/browse/FLINK-2359 Project: Flink Issue Type: Improvemen

[jira] [Created] (FLINK-2360) EOFException

2015-07-14 Thread Andra Lungu (JIRA)
Andra Lungu created FLINK-2360: -- Summary: EOFException Key: FLINK-2360 URL: https://issues.apache.org/jira/browse/FLINK-2360 Project: Flink Issue Type: Bug Components: Local Runtime

[jira] [Created] (FLINK-2361) flatMap + distict gives eroneous results for big data sets

2015-07-14 Thread Andra Lungu (JIRA)
Andra Lungu created FLINK-2361: -- Summary: flatMap + distict gives eroneous results for big data sets Key: FLINK-2361 URL: https://issues.apache.org/jira/browse/FLINK-2361 Project: Flink Issue Ty

[jira] [Created] (FLINK-2362) distinct is missing in DataSet API documentation

2015-07-14 Thread Fabian Hueske (JIRA)
Fabian Hueske created FLINK-2362: Summary: distinct is missing in DataSet API documentation Key: FLINK-2362 URL: https://issues.apache.org/jira/browse/FLINK-2362 Project: Flink Issue Type: Bu

Re: Student looking to contribute to Stratosphere

2015-07-14 Thread Rohit Shinde
Hi, Sorry for the brief hiatus. I was preparing for my GRE exam, but I am back. I am starting to build Flink and a doubt which I had was, is a single-node cluster configuration of Hadoop enough? I assume Hadoop is needed since it is given on the build page. On Sat, Jun 27, 2015 at 8:02 PM, Chiwan

Re: Student looking to contribute to Stratosphere

2015-07-14 Thread Márton Balassi
Hi, Hadoop is not a necessity for running Flink, but rather an option. Try the steps of the setup guide. [1] If you really nee HDFS though to get the best IO performance I would suggest having Hadoop on all your machines running Flink. [1] https://ci.apache.org/projects/flink/flink-docs-release-0

Re: Design documents for consolidated DataStream API

2015-07-14 Thread Márton Balassi
Ok, thanks for the clarification. Let us try to document it in a way that those thoughts are reflected then. Discretization will not happen upfront we can wait with that. On Tue, Jul 14, 2015 at 12:49 PM, Stephan Ewen wrote: > There is no inconsistency between the Batch and Streaming API. They h