Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
Hi Dominique, can you check if the versions of the remotely running job manager & task managers are the same as the Flink version that is used to submit the job? The version and commit hash are logged at the top of the JM and TM log files. Right now, the local client optimizes the job, chooses th

Re: Dataset filter improvement

2016-02-09 Thread Flavio Pompermaier
Any help on this? On 9 Feb 2016 18:03, "Flavio Pompermaier" wrote: > Hi to all, > > in my program I have a Dataset that generated different types of object > wrt the incoming element. > Thus it's like a Map. > In order to type the different generated datasets I do something: > > Dataset start =..

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Dominique Rondé
Hi, your guess is correct. I use java all the time... Here is the complete stacktrace: Exception in thread "main" org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed. at org.apache.flink.client.program.Client.runBlocking(Client.

Re: Flink 1.0-SNAPSHOT scala 2.11 in S3 has scala 2.10

2016-02-09 Thread Chiwan Park
Hi David, I just downloaded the "flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz” but there is no jar compiled with Scala 2.10. Could you check again? Regards, Chiwan Park > On Feb 10, 2016, at 2:59 AM, David Kim > wrote: > > Hello, > > I noticed that the flink binary for scala 2.11 located at > h

Re: Simple Flink - Kafka Test

2016-02-09 Thread Chiwan Park
The documentation I sent is for Flink 1.0. In Flink 0.10.x, there is no suffix of dependencies for Scala 2.10 (e.g. flink-streaming-java). But there is a suffix of dependencies for Scala 2.11 (e.g. flink-streaming-java_2.11). Regards, Chiwan Park > On Feb 10, 2016, at 1:46 PM, Chiwan Park wro

Re: Simple Flink - Kafka Test

2016-02-09 Thread Chiwan Park
Hi shotte, The exception is caused by Scala version mismatch. If you want to use Scala 2.11, you have to set Flink dependencies compiled for Scala 2.11. We have a documentation about this in wiki [1]. I hope this helps. [1]: https://cwiki.apache.org/confluence/display/FLINK/Maven+artifact+nam

Re: Simple Flink - Kafka Test

2016-02-09 Thread shotte
Do I need to go to Flink 1.0 or the downgrade to Kafka 0.8 ? -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Simple-Flink-Kafka-Test-tp4828p4829.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Simple Flink - Kafka Test

2016-02-09 Thread shotte
Hi, I am new to Flink and Kafka I am trying to read from Flink a Kafka topic and sent it back to another Kafka topic Here my setup: Flink 0.10.1 Kafka 0.9 All that on a single node I successfully wrote a Java program that send message to Kafka (topic = demo), and I have a consumer (in a shell)

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
Hi, glad you could resolve the POJO issue, but the new error doesn't look right. The CO_GROUP_RAW strategy should only be used for programs that are implemented against the Python DataSet API. I guess that's not the case since all code snippets were Java so far. Can you post the full stacktrace of

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Dominique Rondé
Hi all, i finally figured out that there is a getter for a boolean field which may be the source of the trouble. It seems that getBooleanField (as we use it) is not the best choice. Now the plan is executed with another error code. :( Caused by: java.lang.Exception: Unsupported driver strate

Re: OutputFormat vs SinkFunction

2016-02-09 Thread Nick Dimiduk
I think managing a lifecycle around the existing MR OutputFormat's makes a lot of sense for the streaming environment. Having them unified in the Flink Streaming API will make users' lives much better, and keeps the streaming world open to the large existing ecosystem. On Tue, Feb 9, 2016 at 6:13

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Till Rohrmann
I tested the TypeExtractor with your SourceA and SourceB types (adding proper setters and getters) and it correctly returned a PojoType. Thus, I would suspect that you haven’t specified the proper setters and getters in your implementation. Cheers, Till ​ On Tue, Feb 9, 2016 at 2:46 PM, Dominique

Flink 1.0-SNAPSHOT scala 2.11 in S3 has scala 2.10

2016-02-09 Thread David Kim
Hello, I noticed that the flink binary for scala 2.11 located at http://stratosphere-bin.s3.amazonaws.com/flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz contains the scala 2.10 flavor. If you open the lib folder the name of the jar in lib is flink-dist_2.10 -1.0-SNAPSHOT.jar. Could this be an error in

Re: How to convert List to flink DataSet

2016-02-09 Thread Stefano Baghino
Assuming your EnvironmentContext is named `env` Simply call: DataSet> fElements = env.*fromCollection* (finalElements); Does this help? On Tue, Feb 9, 2016 at 6:06 PM, subash basnet wrote: > Hello all, > > I have performed a modification in KMeans code to detect outliers. I have > printed the

How to convert List to flink DataSet

2016-02-09 Thread subash basnet
Hello all, I have performed a modification in KMeans code to detect outliers. I have printed the output in the console but I am not able to write it to the file using the given 'writeAsCsv' method. The problem is I generate a list of tuples. My List is: List finalElements = new ArrayList(); Follow

Dataset filter improvement

2016-02-09 Thread Flavio Pompermaier
Hi to all, in my program I have a Dataset that generated different types of object wrt the incoming element. Thus it's like a Map. In order to type the different generated datasets I do something: Dataset start =... Dataset ds1 = start.filter().map(..); Dataset ds2 = start.filter().map(..); Data

Re: Kafka partition alignment for event time

2016-02-09 Thread shikhar
Yes that approach seems perfect Stephan, thanks for creating the JIRA! It is not only when resetting to smallest, I have observed uneven progress on partitions skewing the watermark any time the source is not caught up to the head of each partition it is handling, like when stopping for a few mins

Re: Kafka partition alignment for event time

2016-02-09 Thread Stephan Ewen
Thanks for filling us in. If the problem comes from the fact that the difference between partitions becomes high sometimes (when resetting to the smallest offset), then this could probably be solved similarly as suggested here ( https://issues.apache.org/jira/browse/FLINK-3375) by running a waterm

Re: Kafka partition alignment for event time

2016-02-09 Thread shikhar
Hi Fabian, Sorry, I should have been clearer. What I meant (or now know!) by duplicate emits is that since the watermark is progressing more rapidly than the state of the offsets on some partitions due to the source multiplexing more than 1 partition, when messages from the lagging partitions are

Re: Kafka partition alignment for event time

2016-02-09 Thread shikhar
I am assigning timestamps using a threshold-based extractor -- the static delta from last timestamp is probably sufficient and the PriorityQueue for allowing outliers not necessary, that is something I added while figuring out what was going

Re: OutputFormat vs SinkFunction

2016-02-09 Thread Stephan Ewen
Most of the problems we had with OutputFormats is that many were implemented in a batchy way: - They buffer data and write large chunks at some points - They need the "close()" call before any consistent result is guaranteed That is mostly the case for FileOutputFormats, but not exclusively (s

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Dominique Rondé
Here we go! ExecutionEnvironment env = ExecutionEnvironment.createRemoteEnvironment("xxx.xxx.xxx.xxx", 53408,"flink-job.jar"); DataSource datasourceA= env.readTextFile("hdfs://dev//sourceA/"); DataSource datasourceB= env.readTextFile("hdfs://dev//sourceB/"); DataSet sourceA= datasou

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Till Rohrmann
Could you post the complete example code (Flink job including the type definitions). For example, if the data sets are of type DataSet, then it will be treated as a GenericType. Judging from your pseudo code, it looks fine on the first glance. Cheers, Till ​ On Tue, Feb 9, 2016 at 2:25 PM, Domini

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
String is perfectly fine as key. Looks like SourceA / SourceB are not correctly identified as Pojos. 2016-02-09 14:25 GMT+01:00 Dominique Rondé : > Sorry, i was out for lunch. Maybe the problem is that sessionID is a > String? > > public abstract class Parent{ > private Date eventDate; > priv

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Dominique Rondé
Sorry, i was out for lunch. Maybe the problem is that sessionID is a String? public abstract class Parent{ private Date eventDate; private EventType eventType; private String sessionId; public Parent() { } //GETTER & SETTER } public class SourceA extends Parent{ private Boolean outbound

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Chiwan Park
I wrote a sample inherited POJO example [1]. The example works with Flink 0.10.1 and 1.0-SNAPSHOT. [1]: https://gist.github.com/chiwanpark/0389ce946e4fff58d611 Regards, Chiwan Park > On Feb 9, 2016, at 8:07 PM, Fabian Hueske wrote: > > What is the type of sessionId? > It must be a key type i

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
What is the type of sessionId? It must be a key type in order to be used as key. If it is a generic class, it must implement Comparable to be used as key. 2016-02-09 11:53 GMT+01:00 Dominique Rondé : > The fields in SourceA and SourceB are private but have public getters and > setters. The classe

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Till Rohrmann
Could you share the code for your types SourceA and SourceB. It seems as if Flink does not recognize them to be POJOs because he assigned them the GenericType type. Either there is something wrong with the type extractor or your implementation does not fulfil the requirements for POJOs, as indicate

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Dominique Rondé
The fields in SourceA and SourceB are private but have public getters and setters. The classes provide an empty and public constructor. Am 09.02.2016 11:47 schrieb "Chiwan Park" : > Oh, the fields in SourceA have public getters. Does the fields in SourceA > have public setter? SourceA needs public

Re: Kafka partition alignment for event time

2016-02-09 Thread Stephan Ewen
Hi Shikar! What you are seeing is that some streams (here the different Kafka Partitions in one source) get merged in the source task. That happens before watermarks are generated. In such a case, records are out-of-order when they arrive at the timestamp-extractor/watermark generator, and the wat

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Chiwan Park
Oh, the fields in SourceA have public getters. Does the fields in SourceA have public setter? SourceA needs public setter for private fields. Regards, Chiwan Park > On Feb 9, 2016, at 7:45 PM, Chiwan Park wrote: > > Hi Dominique, > > It seems that `SourceA` is not dealt as POJO. Are all field

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Chiwan Park
Hi Dominique, It seems that `SourceA` is not dealt as POJO. Are all fields in SourceA public? There are some requirements for POJO classes [1]. [1]: https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#pojos Regards, Chiwan Park > On Feb 9, 2016, at 7:42 PM

Join two Datasets --> InvalidProgramException

2016-02-09 Thread Dominique Rondé
Hi folks, i try to join two datasets containing some PoJos. Each PoJo inherit a field "sessionId" from the parent class. The field is private but has a public getter. The join is like this: DataSet> joinedDataSet = sourceA.join(SourceB).where("sessionId").equalTo("sessionId"); But the res

Re: Quick question about enableObjectReuse()

2016-02-09 Thread Stephan Ewen
The only thing you need to be aware of (in the batch API) is that you cannot simply gather elements in a list any more. The following does not work when enabling object reuse: class MyReducer implements GroupReduceFunction { public void reduceGroup(Iterable values, Collector out) {

Re: pom file scala plugin error [flink-examples-batch project]

2016-02-09 Thread Stephan Ewen
Hi! The Eclipse Setup is quite tricky, because Eclipse's Maven and Scala support is not very good. Here is how I got it to run: https://ci.apache.org/projects/flink/flink-docs-release-0.10/internals/ide_setup.html#eclipse-scala-ide-303 Stephan On Tue, Feb 9, 2016 at 10:20 AM, Robert Metzger w

Re: Quick question about enableObjectReuse()

2016-02-09 Thread Till Rohrmann
Yes, you're right Arnaud. Cheers, Till On Tue, Feb 9, 2016 at 10:42 AM, LINZ, Arnaud wrote: > Hi, > > > > I just want to be sure : when I set enableObjectReuse, I don’t need to > create copies of objects that I get as input and return as output but which > I don’t keep inside my user function ?

Quick question about enableObjectReuse()

2016-02-09 Thread LINZ, Arnaud
Hi, I just want to be sure : when I set enableObjectReuse, I don’t need to create copies of objects that I get as input and return as output but which I don’t keep inside my user function ? For instance, if I want to join Tuple2(A,B) with C into Tuple3(A,B,C) using a Join function, I can write

Re: Kafka partition alignment for event time

2016-02-09 Thread Fabian Hueske
Hi, where did you observe the duplicates, within Flink or in Kafka? Please be aware that the Flink Kafka Producer does not provide exactly-once consistency. This is not easily possible because Kafka does not support transactional writes yet. Flink's exactly-once guarantees are only valid within t

Re: OutputFormat vs SinkFunction

2016-02-09 Thread Maximilian Michels
I think you have a point, Nick. OutputFormats on its own have the same fault-tolerance semantics as SinkFunctions. What kind of failure semantics they guarantee depends on the actual implementation. For instance, the RMQSource has exactly-once semantics but the RMQSink currently does not. If you ca

Re: Kafka partition alignment for event time

2016-02-09 Thread Aljoscha Krettek
Hi, in general it should not be a problem if one parallel instance of a sink is responsible for several Kafka partitions. It can become a problem if the timestamps in the different partitions differ by a lot and the watermark assignment logic is not able to handle this. How are you assigning th

Re: pom file scala plugin error [flink-examples-batch project]

2016-02-09 Thread Robert Metzger
Hi Subash, I think the two errors are unrelated. Maven is failing because of the checkstyle plugin. It checks if the code follows our coding guidelines. If you are experienced with IntelliJ, I would suggest to use that IDE. Most Flink committers are using it because its difficult to get Eclipse s

Re: GC on TaskManagers stats

2016-02-09 Thread Robert Metzger
Hi Guido, sorry for the late reply. You were collecting the stats every 1 second. Afaik, Flink is internally collecting the stats with a frequency of 5 seconds, so you can either change your or Flink's polling interval (I think its taskmanager.heartbeat-interval) Regarding the details on PS-Scave