Re: YARN High Availability

2015-11-18 Thread Till Rohrmann
Hi Gwenhaël, good to hear that you could resolve the problem. When you run multiple HA flink jobs in the same cluster, then you don’t have to adjust the configuration of Flink. It should work out of the box. However, if you run multiple HA Flink cluster, then you have to set for each cluster a d

Re: Specially introduced Flink to chinese users in CNCC(China National Computer Congress)

2015-11-18 Thread Welly Tambunan
agree, and Stateful Streaming operator instance in Flink is looks natural compare to Apache Spark. On Thu, Nov 19, 2015 at 10:54 AM, Liang Chen wrote: > Two aspects are attracting them: > 1.Flink is using java, it is easy for most of them to start Flink, and be > more easy to maintain in compar

Re: Specially introduced Flink to chinese users in CNCC(China National Computer Congress)

2015-11-18 Thread Liang Chen
Two aspects are attracting them: 1.Flink is using java, it is easy for most of them to start Flink, and be more easy to maintain in comparison to Storm(as Clojure is difficult to maintain, and less people know it.) 2.Users really want an unified system supporting streaming and batch processing.

Re: Flink execution time benchmark.

2015-11-18 Thread Saleh
Hi rmetzger0, Thanx for the response. I didn't know that I had to register before I could receive responses for my posts. Now I am registered. But the problem is not resolved yet. I know it might not be intuitive to get execution time from a long running streaming job but it is possible to get to

Compiler Exception

2015-11-18 Thread Truong Duc Kien
Hi, I'm hitting Compiler Exception with some of my data set, but not all of them. Exception in thread "main" org.apache.flink.optimizer.CompilerException: No plan meeting the requirements could be created @ Bulk Iteration (Bulk Iteration) (1:null). Most likely reason: Too restrictive plan hints.

Re: Fold vs Reduce in DataStream API

2015-11-18 Thread Kostas Tzoumas
Granted, both are presented with the same example in the docs. They are modeled after reduce and fold in functional programming. Perhaps we should have a bit more enlightening examples. On Wed, Nov 18, 2015 at 6:39 PM, Fabian Hueske wrote: > Hi Ron, > > Have you checked: > https://ci.apache.org/

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Flavio Pompermaier
it was long ago..but if I remember correctly they were about 50k On 18 Nov 2015 19:22, "Stephan Ewen" wrote: > Okay, let me take a step back and make sure I understand this right... > > With many small files it takes longer to start the job, correct? How much > time did it actually take and how m

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Stephan Ewen
Okay, let me take a step back and make sure I understand this right... With many small files it takes longer to start the job, correct? How much time did it actually take and how many files did you have? On Wed, Nov 18, 2015 at 7:18 PM, Flavio Pompermaier wrote: > in my test I was using the lo

Re: FlinkKafkaConsumer and multiple topics

2015-11-18 Thread Stephan Ewen
The new KafkaConsumer fro Kafka 0.9 should be able to handle this, as the Kafka Client Code itself has support for this then. For 0.8.x, we would need to implement support for recovery inside the consumer ourselves, which is why we decided to initially let the Job Recovery take care of that. If th

Re: Published test artifacts for flink streaming

2015-11-18 Thread Nick Dimiduk
Please see the above gist: my test makes no assertions until after the env.execute() call. Adding setParallelism(1) to my sink appears to stabilize my test. Indeed, very helpful. Thanks a lot! -n On Wed, Nov 18, 2015 at 9:15 AM, Stephan Ewen wrote: > Okay, I think I misunderstood your problem.

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Flavio Pompermaier
in my test I was using the local fs (ext4) On 18 Nov 2015 19:17, "Stephan Ewen" wrote: > The JobManager does not read all files, but is has to query the HDFS for > all file metadata (size, blocks, block locations), which can take a bit. > There is a separate call to the HDFS Namenode for each fil

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Stephan Ewen
The JobManager does not read all files, but is has to query the HDFS for all file metadata (size, blocks, block locations), which can take a bit. There is a separate call to the HDFS Namenode for each file. The more files, the more metadata has to be collected. On Wed, Nov 18, 2015 at 7:15 PM, Fl

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Flavio Pompermaier
So why it takes so much to start the job?because in any case the job manager has to read all the lines of the input files before generating the splits? On 18 Nov 2015 17:52, "Stephan Ewen" wrote: > Late answer, sorry: > > The splits are created in the JobManager, so the sub submission should not

Re: Fold vs Reduce in DataStream API

2015-11-18 Thread Fabian Hueske
Hi Ron, Have you checked: https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#transformations ? Fold is like reduce, except that you define a start element (of a different type than the input type) and the result type is the type of the initial value. In reduce,

RE: YARN High Availability

2015-11-18 Thread Gwenhael Pasquiers
Nevermind, Looking at the logs I saw that it was having issues trying to connect to ZK. To make I short is had the wrong port. It is now starting. Tomorrow I’ll try to kill some JobManagers *evil*. Another question : if I have multiple HA flink jobs, are there some points to check in order to

Fold vs Reduce in DataStream API

2015-11-18 Thread Ron Crocker
Is there a succinct description of the distinction between these transforms? Ron — Ron Crocker Principal Engineer & Architect ( ( •)) New Relic rcroc...@newrelic.com M: +1 630 363 8835

Re: Published test artifacts for flink streaming

2015-11-18 Thread Stephan Ewen
Okay, I think I misunderstood your problem. Usually you can simply execute tests one after another by waiting until "env.execute()" returns. The streaming jobs terminate by themselves once the sources reach end of stream (finite streams are supported that way) but make sure all records flow throug

Re: YARN High Availability

2015-11-18 Thread Till Rohrmann
Hi Gwenhaël, do you have access to the yarn logs? Cheers, Till On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers < gwenhael.pasqui...@ericsson.com> wrote: > Hello, > > > > We’re trying to set up high availability using an existing zookeeper > quorum already running in our Cloudera cluster. >

YARN High Availability

2015-11-18 Thread Gwenhael Pasquiers
Hello, We're trying to set up high availability using an existing zookeeper quorum already running in our Cloudera cluster. So, as per the doc we've changed the max attempt in yarn's config as well as the flink.yaml. recovery.mode: zookeeper recovery.zookeeper.quorum: host1:3181,host2:3181,hos

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Stephan Ewen
Late answer, sorry: The splits are created in the JobManager, so the sub submission should not be affected by that. The assignment of splits to workers is very fast, so many splits with small data is not very different from few splits with large data. Lines are never materialized and the operato

Re: Reading null value from datasets

2015-11-18 Thread Stephan Ewen
Hi Guido! If you use Scala, I would use an Option to represent nullable fields. That is a very explicit solution that marks which fields can be null, and also forces the program to handle this carefully. We are looking to add support for Java 8's Optional type as well for exactly that purpose. G

Re: Specially introduced Flink to chinese users in CNCC(China National Computer Congress)

2015-11-18 Thread Stephan Ewen
Thank you indeed for presenting there. It looks like a very large audience! Greetings, Stephan On Mon, Oct 26, 2015 at 11:24 AM, Maximilian Michels wrote: > Hi Liang, > > We greatly appreciate you introduced Flink to the Chinese users at CNCC! > We would love to hear how people like Flink. >

Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-11-18 Thread Maximilian Michels
Yes, that does make sense! Thank you for explaining. Have you made the change yet? I couldn't find it on the master. On Wed, Nov 18, 2015 at 5:16 PM, Stephan Ewen wrote: > That makes sense... > > On Mon, Oct 26, 2015 at 12:31 PM, Márton Balassi > wrote: >> >> Hey Max, >> >> The solution I am pro

Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-11-18 Thread Stephan Ewen
That makes sense... On Mon, Oct 26, 2015 at 12:31 PM, Márton Balassi wrote: > Hey Max, > > The solution I am proposing is not flushing on every record, but it makes > sure to forward the flushing from the sinkfunction to the outputformat > whenever it is triggered. Practically this means that th

Re: Published test artifacts for flink streaming

2015-11-18 Thread Nick Dimiduk
Sorry Stephan but I don't follow how global order applies in my case. I'm merely checking the size of the sink results. I assume all tuples from a given test invitation have sunk before the next test begins, which is clearly not the case. Is there a way I can place a barrier in my tests to ensure o

RE: MaxPermSize on yarn

2015-11-18 Thread Gwenhael Pasquiers
The option was accepted using the yaml file and it looks likes it solved our issue. Thanks again. From: Robert Metzger [mailto:rmetz...@apache.org] Sent: mardi 17 novembre 2015 12:04 To: user@flink.apache.org Subject: Re: MaxPermSize on yarn You can also put the configuration option into the fl

Re: Checkpoints in batch processing & JDBC Output Format

2015-11-18 Thread Stephan Ewen
Hi! If you go with the Batch API, then any failed task (like a sink trying to insert into the database) will be completely re-executed. That makes sure no data is lost in any way, no extra effort needed. It may insert a lot of duplicates, though, if the task is re-started after half the data was

Re: Published test artifacts for flink streaming

2015-11-18 Thread Stephan Ewen
There is no global order in parallel streams, it is something that applications need to work with. We are thinking about adding operations to introduce event-time order (at the cost of some delay), but that is only plans at this point. What I do in my tests is run the test streams in parallel, bu

Re: finite subset of an infinite data stream

2015-11-18 Thread Aljoscha Krettek
Hi, I wrote a little example that could be what you are looking for: https://github.com/dataArtisans/query-window-example It basically implements a window operator with a modifiable window size that also allows querying the current accumulated window contents using a second input stream. There

Re: Session Based Windows

2015-11-18 Thread Konstantin Knauf
Hi Aljoscha, thanks, that's what I thought. Just wanted to verify, that keyBy + SessionWindow() works with intermingled events. Cheers, Konstantin On 18.11.2015 11:14, Aljoscha Krettek wrote: > Hi Konstatin, > you are right, if the stream is keyed by the session-id then it works. > > I was ref

Re: Creating a representative streaming workload

2015-11-18 Thread Robert Metzger
Hey Vasia, I think a very common workload would be an event stream from web servers of an online shop. Usually, these shops have multiple servers, so events arrive out of order. I think there are plenty of different use cases that you can build around that data: - Users perform different actions t

Re: Session Based Windows

2015-11-18 Thread Aljoscha Krettek
Hi Konstatin, you are right, if the stream is keyed by the session-id then it works. I was referring to the case where you have, for example, some interactions with timestamps and you want to derive the sessions from this. In that case, it can happen that events that should belong to one session

Re: Session Based Windows

2015-11-18 Thread Vladimir Stoyak
We, were also trying to address session windowing but took slightly different approach as to what window we place the event into. We did not want "triggering event" to be purged as part of the window it triggered, but instead to create a new window for it and have the old window to fire and pu