[jira] [Created] (FLINK-2333) Stream Data Sink that periodically rolls files

2015-07-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2333: --- Summary: Stream Data Sink that periodically rolls files Key: FLINK-2333 URL: https://issues.apache.org/jira/browse/FLINK-2333 Project: Flink Issue Type: New F

[jira] [Created] (FLINK-2334) IOException: Channel to path could not be opened

2015-07-09 Thread David Heller (JIRA)
David Heller created FLINK-2334: --- Summary: IOException: Channel to path could not be opened Key: FLINK-2334 URL: https://issues.apache.org/jira/browse/FLINK-2334 Project: Flink Issue Type: Bug

[jira] [Created] (FLINK-2335) Rework iteration construction in StreamGraph

2015-07-09 Thread Gyula Fora (JIRA)
Gyula Fora created FLINK-2335: - Summary: Rework iteration construction in StreamGraph Key: FLINK-2335 URL: https://issues.apache.org/jira/browse/FLINK-2335 Project: Flink Issue Type: Improvement

[jira] [Created] (FLINK-2336) ArrayIndexOufOBoundsException in TypeExtractor when mapping

2015-07-09 Thread William Saar (JIRA)
William Saar created FLINK-2336: --- Summary: ArrayIndexOufOBoundsException in TypeExtractor when mapping Key: FLINK-2336 URL: https://issues.apache.org/jira/browse/FLINK-2336 Project: Flink Issu

[jira] [Created] (FLINK-2337) Multiple SLF4J bindings using Storm compatibility layer

2015-07-09 Thread Matthias J. Sax (JIRA)
Matthias J. Sax created FLINK-2337: -- Summary: Multiple SLF4J bindings using Storm compatibility layer Key: FLINK-2337 URL: https://issues.apache.org/jira/browse/FLINK-2337 Project: Flink Iss

[jira] [Created] (FLINK-2338) Shut down "Storm Topologies" clenaly

2015-07-09 Thread Matthias J. Sax (JIRA)
Matthias J. Sax created FLINK-2338: -- Summary: Shut down "Storm Topologies" clenaly Key: FLINK-2338 URL: https://issues.apache.org/jira/browse/FLINK-2338 Project: Flink Issue Type: Improvemen

Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread 马国维
hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure "exactly once."Does the DataSet job use the same mechanism to ensue "exactly once" if a map task is failed?thanks

Re: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread Kostas Tzoumas
No, it doesn't; periodic snapshots are not needed in DataSet programs, as DataSets are of finite size and failed partitions can be replayed completely. On Thu, Jul 9, 2015 at 2:43 PM, 马国维 wrote: > hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure > "exactly once."Does the DataSe

Re: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread Márton Balassi
As Kostas mentioned the failure mechanisms for streaming and batch processing are different, but you can expect exactly once processing guarantees from both of them. On Thu, Jul 9, 2015 at 2:43 PM, 马国维 wrote: > hi, everyoneThe doc say Flink Streaming use "Barriers" to ensure > "exactly once."

RE: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread 马国维
DataExchangeMode is Piped If Two operators use Piped Mode to exchange the data , Failed partitions have already send some data to the receiver before it failed.So Does Replaying all the failed partitions cause some duplicate records ? > Date: Thu, 9 Jul 2015 14:47:29 +0200 > Subject: Re: Doe

Re: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread Stephan Ewen
Currently, Flink restarts the entire job upon failure. There is WIP that restricts this to all tasks involved in the pipeline of the failed task. Let's say we have pipelined MapReduce. If a mapper fails, the reducers that have received some data already have to be restarted as well. In that case

[jira] [Created] (FLINK-2339) Prevent asynchronous checkpoint calls from overtaking each other

2015-07-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2339: --- Summary: Prevent asynchronous checkpoint calls from overtaking each other Key: FLINK-2339 URL: https://issues.apache.org/jira/browse/FLINK-2339 Project: Flink

RE: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread 马国维
DataSet result = in.rebalance() .map(new Mapper());In the case does the 'map' receive all the data then begin to worker?Will rebalance operator failed cause some duplicate record if the above answer is false ? > Date: Thu, 9 Jul 2015 15:40:18 +0200 > Subject: Re: Does

Re: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread Maximilian Michels
In pipelined execution, the mapper will start once it receives data. In batch-only execution, the mapper will start once it received all data. In either case, there won't be any duplicate records. If an error occurs, the entire job will be restarted. As Stephan mentioned, we will soon have a per-t

Re: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread Stephan Ewen
Any operator in a batch job will receive all of its elements in one complete successful run. The mapper starts its work immediately. On a failure, a fresh mapper is used, and all of the data is replayed. You can think of it as if there was only a single checkpoint at the very beginning (before any

Re: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread 马国维
thank you!I see. thank all of you. 发自我的 iPhone > 在 2015年7月9日,下午10:48,Stephan Ewen 写道: > > Any operator in a batch job will receive all of its elements in one > complete successful run. > > The mapper starts its work immediately. On a failure, a fresh mapper is > used, and all of the data is re

[jira] [Created] (FLINK-2340) Provide standalone mode for web interface of the JobManager

2015-07-09 Thread Maximilian Michels (JIRA)
Maximilian Michels created FLINK-2340: - Summary: Provide standalone mode for web interface of the JobManager Key: FLINK-2340 URL: https://issues.apache.org/jira/browse/FLINK-2340 Project: Flink

Re: Does DataSet job also use Barriers to ensure "exactly once."?

2015-07-09 Thread Stephan Ewen
BTW: Duplicates on interaction with the outside world cannot be avoided in the general case. If any program (batch or streaming) inserts data into some outside system (for example a database), then that outside system needs to cooperate to prevent duplicates. - Either, the system could eliminate d

[jira] [Created] (FLINK-2341) Deadlock in SpilledSubpartitionViewAsyncIO

2015-07-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2341: --- Summary: Deadlock in SpilledSubpartitionViewAsyncIO Key: FLINK-2341 URL: https://issues.apache.org/jira/browse/FLINK-2341 Project: Flink Issue Type: Bug

Re: Building several models in parallel

2015-07-09 Thread Felix Neutatz
The problem is that I am not able to incorporate this into a Flink iteration since the MultipleLinearRegression() already contains an iteration. Therefore this would be a nested iteration ;-) So at the moment I just use a for loop like this: data: DataSet[model_ID, DataVector] for (current_model

Re: Building several models in parallel

2015-07-09 Thread Till Rohrmann
It's perfectly fine. That way you will construct a data flow graph which will calculate the linear model for all different groups once you define an output and trigger the execution. On Thu, Jul 9, 2015 at 6:05 PM, Felix Neutatz wrote: > The problem is that I am not able to incorporate this into

serialization issue

2015-07-09 Thread Felix Neutatz
Hi, I want to use t-digest by Ted Dunning ( https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/ArrayDigest.java) on Flink. Locally that works perfectly. But on the cluster I get the following error: java.lang.Exception: Call to registerInputOutput() of invokab

Re: How do network transmissions in Flink work?

2015-07-09 Thread Niklas Semmler
Hi Stephan, thanks for the input :). I have some further questions on how the data is segmented before/when it is moved over the network: 1. What does the number of ResultSubpartition instances in the ResultPartition correspond to? Is one assigned to each consuming task? If so, how can I fi