RE: Kyro Intermittent Exception for Large Data

2016-02-18 Thread Ken Krugler
I've seen this type of error when using Kryo with a Cascading scheme I'd created. In my case it happened when serializing a large object graph, where some of the classes didn't have no-arg constructors. The general fix was to set an instantiator strategy for Kryo - see: https://github.com/Scal

Re: Flink HA

2016-02-18 Thread Ufuk Celebi
On Thu, Feb 18, 2016 at 6:59 PM, Thomas Lamirault wrote: > We are trying flink in HA mode. Great to hear! > We set in the flink yaml : > > state.backend: filesystem > > recovery.mode: zookeeper > recovery.zookeeper.quorum: > > recovery.zookeeper.path.root: > > recovery.zookeeper.storageDir: >

Flink HA

2016-02-18 Thread Thomas Lamirault
Hi ! We are trying flink in HA mode. Our application is a streaming application with windowing mechanism. We set in the flink yaml : state.backend: filesystem recovery.mode: zookeeper recovery.zookeeper.quorum: recovery.zookeeper.path.root: recovery.zookeeper.storageDir: recovery.back

Re: Finding the average temperature

2016-02-18 Thread Nirmalya Sengupta
Hello Aljoscha , You mentioned: '.. Yes, this is right if you temperatures don’t have any other field on which you could partition them. '. What I am failing to understand is that if temperatures are partitioned on some other field (in my use-case, I have one such: the temp_reading_timestamp), th

Re: streaming hdfs sub folders

2016-02-18 Thread Martin Neumann
I tried to implement your idea but I'm getting NullPointer exceptions from the AvroInputFormat any Idea what I'm doing wrong? See the code below: public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExec

Re: streaming hdfs sub folders

2016-02-18 Thread Martin Neumann
I guess I need to set the parallelism for the FlatMap to 1 to make sure I read one file at a time. The downside I see with this is that I will be not able to read in parallel from HDFS (and the files are Huge). I give it a try and see how much performance I loose. cheers Martin On Thu, Feb 18, 2

Re: Read every file in a directory at once

2016-02-18 Thread Flavio Pompermaier
My current solution is: List paths = new ArrayList(); File dir = new File(BASE_DIR); for (File f : dir.listFiles()) { paths.add(f.getName()); } DataSet mail = env.fromCollection(paths).map(new FileToString(BASE_DIR)). The FileToString does basically a map that return FileUtils.toString(new

Re: How to iterate over DataSet elements without converting it to List

2016-02-18 Thread Judit Fehér
Hi, if you want to iterate through a DataSet you can simply use the map function on the DataSets instead of for loops. In your example you have nested loops, instead of this you can join the two datasets and then perform the map function. It looks like you may want to implement a k-means algorithm

Read every file in a directory at once

2016-02-18 Thread Flavio Pompermaier
Hi to all, I want to apply a map function to every file in a folder. Is there an easy way (or an already existing InputFormat) to do that? Best, Flavio

Re: Kafka partition alignment for event time

2016-02-18 Thread Erdem Agaoglu
Thanks Stephan On Thu, Feb 18, 2016 at 3:00 PM, Stephan Ewen wrote: > You are right, the checkpoints should contain all offsets. > > I created a Ticket for this: > https://issues.apache.org/jira/browse/FLINK-3440 > > > > > On Thu, Feb 18, 2016 at 10:15 AM, agaoglu wrote: > >> Hi, >> >> On a rel

How to iterate over DataSet elements without converting it to List

2016-02-18 Thread subash basnet
Hello there, I have been stuck on how to iterate over the DataSet, perform operations and return a new modified DataSet similar to that of list operation as shown below. Eg: for (Centroid centroid : centroids.collect()) { for (Tuple2 element : clusteredPoints.collect()) { //perform nece

Re: Very old dependencies and solutions

2016-02-18 Thread Andrew Ge Wu
Thanks Stephan, problem solved. here is my configuration org.apache.maven.plugins maven-shade-plugin 2.4.3 package shade my.main.class

Re: Changing parallelism

2016-02-18 Thread Zach Cox
Hi Ufuk - thanks for the 2016 roadmap - glad to see changing parallelism is the first bullet :) Mesos support also sounds great, we're currently running job and task managers on Mesos statically via Marathon. Hi Stephan - thanks, that trick sounds pretty clever, I will try wrapping my head around

Re: Availability for the ElasticSearch 2 streaming connector

2016-02-18 Thread Zach Cox
Awesome, thanks Suneel. :D I made the changes to support our use case, which needed flatMap behavior (index 2 docs, or zero docs, per incoming element) instead of map, and we also need to make either IndexRequest or UpdateRequest depending on the element. -Zach On Thu, Feb 18, 2016 at 2:06 AM S

Re: streaming hdfs sub folders

2016-02-18 Thread Stephan Ewen
Martin, I think you can approximate this in an easy way like this: - On the client, you traverse your directories to collect all files that you need, collect all file paths in a list. - Then you have a source "env.fromElements(paths)". - Then you flatMap and in the FlatMap, run the Avro inp

Re: Changing parallelism

2016-02-18 Thread Stephan Ewen
Hi Zach! Yes, changing parallelism is pretty high up the priority list. The good news is that "scaling in" is the simpler part of changing the parallelism and we are pushing to get that in soon. Until then, there is only a pretty ugly trick that you can do right now to "rescale' the state: 1)

Re: Very old dependencies and solutions

2016-02-18 Thread Stephan Ewen
Hi! A lot of those dependencies are pulled in by Hadoop (for example the configuration / HTTP components). In 1.0-SNAPSHOT, the HTTP components dependency has been shaded away in Hadoop, so it should not bother you any more. One solution you can always do is to "shade" your dependencies in your

Re: Kafka partition alignment for event time

2016-02-18 Thread Stephan Ewen
You are right, the checkpoints should contain all offsets. I created a Ticket for this: https://issues.apache.org/jira/browse/FLINK-3440 On Thu, Feb 18, 2016 at 10:15 AM, agaoglu wrote: > Hi, > > On a related and a more exaggerated setup, our kafka-producer (flume) seems > to send data to a

Very old dependencies and solutions

2016-02-18 Thread Andrew Ge Wu
Hi guys, You probably have noticed. I found a lot of old dependencies (http component 3.1/apache configuration 1.6 etc..) in Flink and leads up errors to stuff like this: java.lang.NoClassDefFoundError: Could not initialize class org.apache.http.conn.ssl.SSLConnectionSocketFactory Is there an

Re: Finding the average temperature

2016-02-18 Thread Stephan Ewen
Combiners in streaming are a bit tricky, from their semantics: 1) Combiners always hold data back, through the preaggregation. That adds latency and also means the values are not in the actual windows immediately, where a trigger may expect them. 2) In batch, a combiner combines as long as there

Re: Problem with Kafka 0.9 Client

2016-02-18 Thread Robert Metzger
Hi Javier, sorry for the late response. In the Error Mapping of Kafka, it says that code 15 means: ConsumerCoordinatorNotAvailableCode. https://github.com/apache/kafka/blob/0.9.0/core/src/main/scala/kafka/common/ErrorMapping.scala How many brokers did you put into the list of bootstrap servers? C

Re: Finding the average temperature

2016-02-18 Thread Aljoscha Krettek
They would be awesome, but it’s not yet possible in Flink Streaming, I’m afraid. > On 18 Feb 2016, at 10:59, Stefano Baghino > wrote: > > I think combiners are pretty awesome for certain cases to minimize network > usage (the average use case seems to fit perfectly), maybe it would be > worth

Re: Finding the average temperature

2016-02-18 Thread Stefano Baghino
I think combiners are pretty awesome for certain cases to minimize network usage (the average use case seems to fit perfectly), maybe it would be worthwhile adding a detailed description of the approach to the docs? On Thu, Feb 18, 2016 at 10:47 AM, Aljoscha Krettek wrote: > @Nirmalya: Yes, this

Re: Finding the average temperature

2016-02-18 Thread Aljoscha Krettek
@Nirmalya: Yes, this is right if you temperatures don’t have any other field on which you could partition them. @Stefano: Under some circumstances it would be possible to use a a combiner (I’m using the name as Hadoop MapReduce would use it, here). When the assignment of elements to windows hap

Re: Kafka partition alignment for event time

2016-02-18 Thread agaoglu
Hi, On a related and a more exaggerated setup, our kafka-producer (flume) seems to send data to a single partition at a time and switches it every few minutes. So when i run my flink datastream program for the first time, it starts on the *largest* offsets and shows something like this: . Fetched

Re: Finding the average temperature

2016-02-18 Thread Stefano Baghino
Thanks, Aljosha, for the explanation. Isn't there a way to apply the concept of the combiner to a streaming process? On Thu, Feb 18, 2016 at 3:56 AM, Nirmalya Sengupta < sengupta.nirma...@gmail.com> wrote: > Hello Aljoscha > > Thanks very much for clarifying the role of Pre-Aggregation (rather

Re: Changing parallelism

2016-02-18 Thread Ufuk Celebi
Hey Zach! Sounds like a great use case. On Wed, Feb 17, 2016 at 3:16 PM, Zach Cox wrote: > However, the savepoint docs state that the job parallelism cannot be changed > over time [1]. Does this mean we need to use the same, fixed parallelism=n > during reprocessing and going forward? Are there

Re: Availability for the ElasticSearch 2 streaming connector

2016-02-18 Thread Suneel Marthi
Thanks Zach, I have a few minor changes too locally; I'll push a PR out tomorrow that has ur changes too. On Wed, Feb 17, 2016 at 5:13 PM, Zach Cox wrote: > I recently did exactly what Robert described: I copied the code from this > (closed) PR https://github.com/apache/flink/pull/1479, modified