Re: streaming join implementation

2016-04-13 Thread Balaji Rajagopalan
Let me give you specific example, say stream1 event1 happened within your window 0-5 min with key1, and event2 on stream2 with key2 which could have matched with key1 happened at 5:01 outside the join window, so now you will have to co-relate the event2 on stream2 with the event1 with stream1 which

Re: streaming join implementation

2016-04-13 Thread Henry Cai
Thanks Balaji. Do you mean you spill the non-matching records after 5 minutes into redis? Does flink give you control on which records is not matching in the current window such that you can copy into a long-term storage? On Wed, Apr 13, 2016 at 11:20 PM, Balaji Rajagopalan < balaji.rajagopa..

Re: streaming join implementation

2016-04-13 Thread Balaji Rajagopalan
You can implement join in flink (which is a inner join) the below mentioned pseudo code . The below join is for a 5 minute interval, yes will be some corners cases when the data coming after 5 minutes will be missed out in the join window, I actually had solved this problem but storing some data i

streaming join implementation

2016-04-13 Thread Henry Cai
Hi, We are evaluating different streaming platforms. For a typical join between two streams select a.*, b.* FROM a, b ON a.id == b.id How does flink implement the join? The matching record from either stream can come late, we consider it's a valid join as long as the event time for record a an

Re: Does Kafka connector leverage Kafka message keys?

2016-04-13 Thread Elias Levy
On Wed, Apr 13, 2016 at 2:10 AM, Stephan Ewen wrote: > If you want to use Flink's internal key/value state, however, you need to > let Flink re-partition the data by using "keyBy()". That is because Flink's > internal sharding of state (including the re-sharding to adjust parallelism > we are cur

Re: RocksDB Statebackend

2016-04-13 Thread Konstantin Knauf
Hi Aljoscha, thanks for your answers. I am currently not in the office, so I can not run any further analysis until Monday. Just some quick answers to your questions. We are using the partitioned state abstraction, most of the state should correspond to buffered events in windows. Parallelism is

Re: Sink partitioning

2016-04-13 Thread Konstantin Knauf
Hi, calling DataStream.partitionCustom() with the respective arguments before the sink should do the trick, I think. Cheers, Konstantin On 14.04.2016 01:22, neo21 zerro wrote: > Hello everybody, > > I have an elasticsearch sink in my flink topology. > My requirement is to write the data in a p

Re: How to perform this join operation?

2016-04-13 Thread Maxim
You could simulate the Samza approach by having a RichFlatMapFunction over cogrouped streams that maintains the sliding window in its ListState. As I understand the drawback is that the list state is not maintained in the managed memory. I'm interested to hear about the right way to implement this.

Sink partitioning

2016-04-13 Thread neo21 zerro
Hello everybody, I have an elasticsearch sink in my flink topology. My requirement is to write the data in a partitioned fashion to my Sink. For example I have Tuple which contains a user id. I want to group all events by a user id and partition all events for one particular user to the same Es

Missing metrics on Flink UI

2016-04-13 Thread neo21 zerro
Hello everybody, I have deployed the latest Flink Version 1.0.1 on Yarn 2.5.0-cdh5.3.0. When I push the WordCount example shipped with the Flink distribution, I can see metrics (bytes received) in the Flink Ui on the corresponding operator. However, I used the flink kafka connector and when I r

How to perform this join operation?

2016-04-13 Thread Elias Levy
I am wondering how you would implement the following function in Flink. The function takes as input two streams. One stream can be viewed a a tuple with two value *(x, y)*, the second stream is a stream of individual values *z*. The function keeps a time based window on the first input (say, 24 h

Re: asm IllegalArgumentException with 1.0.0

2016-04-13 Thread Stephan Ewen
Does this problem persist? (It might have been caused by maven caches with bad artifacts). The many transitive dependencies you see often come from the connectors - that is also why we do not put the connectors into the lib folder directly, so that these libraries are not always part of every Flin

Re: Limit buffer size for a job

2016-04-13 Thread Stephan Ewen
You can reduce Flink's internal network buffering by adjusting the total number of network buffers. See https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#configuring-the-network-buffers That should also make a difference. On Wed, Apr 13, 2016 at 3:57 PM, Andrew Ge Wu

Flink performance pre-packaged vs. self-compiled

2016-04-13 Thread Robert Schmidtke
Hi everyone, I'm using Flink 0.10.2 for some benchmarks and had to add some small changes to Flink, which led me to compiling and running it myself. This is when I noticed a performance difference in the pre-packaged Flink version that I downloaded from the web ( http://archive.apache.org/dist/fli

Re: Limit buffer size for a job

2016-04-13 Thread Andrew Ge Wu
Thanks guys for the explanation, I will give it a try. After all buffers are filled, the back pressure did it’s job, it works so far so good, but I will defiantly give a try to control the latency. Thanks again! Andrew > On 11 Apr 2016, at 18:19, Stephan Ewen wrote: > > Hi! > > Ufuk's su

Re: RocksDB Statebackend

2016-04-13 Thread Aljoscha Krettek
That's interesting to hear. If you want we can also collaborate on that one. Using the Flink managed memory for that purpose would require some changes to lower layers of Flink. On Wed, 13 Apr 2016 at 13:11 Shannon Carey wrote: > This is something that my team and I have discussed building, so i

Re: RocksDB Statebackend

2016-04-13 Thread Shannon Carey
This is something that my team and I have discussed building, so it's great to know that it's already on the radar. If we beat you to it, I'll definitely try to make it a contribution. Shannon From: Aljoscha Krettek mailto:aljos...@apache.org>> Date: Wednesday, April 13, 2016 at 1:46 PM To: mai

Re: Flink ML 1.0.0 - Saving and Loading Models to Score a Single Feature Vector

2016-04-13 Thread KirstiLaurila
Now I got this working in cloud (not locally, but it's ok) so thanks a lot. Next problem is how to read then these written files and add them to the als. I guess it is something like val als = ALS() als.factorsOption = Option(users,items) but I don't get how I could read in the data I hav

Re: threads, parallelism and task managers

2016-04-13 Thread Stephan Ewen
No problem ;-) On Wed, Apr 13, 2016 at 11:38 AM, Stefano Bortoli wrote: > Sounds you are damn right! thanks for the insight, dumb on us for not > checking this before. > > saluti, > Stefano > > 2016-04-13 11:05 GMT+02:00 Stephan Ewen : > >> Sounds actually not like a Flink issue. I would look in

Re: threads, parallelism and task managers

2016-04-13 Thread Stefano Bortoli
Sounds you are damn right! thanks for the insight, dumb on us for not checking this before. saluti, Stefano 2016-04-13 11:05 GMT+02:00 Stephan Ewen : > Sounds actually not like a Flink issue. I would look into the commons pool > docs. > Maybe they size their pools by default with the number of c

Re: threads, parallelism and task managers

2016-04-13 Thread Stephan Ewen
Sounds actually not like a Flink issue. I would look into the commons pool docs. Maybe they size their pools by default with the number of cores, so the pool has only 8 threads, and other requests are queues? On Wed, Apr 13, 2016 at 10:29 AM, Flavio Pompermaier wrote: > Any feedback about our JD

Re: Does Kafka connector leverage Kafka message keys?

2016-04-13 Thread Stephan Ewen
Hi! You can exploit that, yes. If you read data from Kafka in Flink, a Kafka partition is "sticky" to a Flink source subtask. If you have (kafka-source => mapFunction) for example, you can be sure that all values with the same key go through the same parallel mapFunction subtask. If you maintain a

Re: threads, parallelism and task managers

2016-04-13 Thread Flavio Pompermaier
Any feedback about our JDBC InputFormat issue..? On Thu, Apr 7, 2016 at 12:37 PM, Flavio Pompermaier wrote: > We've finally created a running example (For Flink 0.10.2) of our improved > JDBC imputformat that you can run from an IDE (it creates an in-memory > derby database with 1000 rows and ba

Re: FoldFunction accumulator checkpointing

2016-04-13 Thread Aljoscha Krettek
Hi, there are two cases where a FoldFunction can be used in the streaming API: KeyedStream.fold() and WindowedStream.fold()/apply(). In both cases we internally use the partitioned state abstraction of Flink to keep the state. So yes, the accumulator value is consistently maintained and will surviv

Re: RocksDB Statebackend

2016-04-13 Thread Aljoscha Krettek
Hi Maxim, yes the plan is to have a cache of hot values that uses the managed memory abstraction of Flink so that we can make sure that we stay within memory bounds and don't run into OOM exceptions. On Tue, 12 Apr 2016 at 23:37 Maxim wrote: > Is it possible to add an option to store the state i

Re: Does Kafka connector leverage Kafka message keys?

2016-04-13 Thread Andrew Coates
Hi Stephan, If we were to do that, would flink leverage the fact that Kafka has already partitioned the data by the key, or would flink attempt to shuffle the data again into its own partitions, potentially shuffling data between machines for no gain? Thanks, Andy On Sun, 10 Apr 2016, 13:22 Ste