Re: whats is the purpose or impact of -yst( --yarnstreaming ) argument

2016-05-25 Thread Aljoscha Krettek
Hi Prateek, this is a deprecated setting that affects how memory is allocated in Flink Worker nodes. Since at least 1.0.0 the default behavior is the behavior that would previously be requested by the --yst flag. In short, you don't need the flag when running streaming programs. (Except Robert has

Re: Incremental updates

2016-05-25 Thread Aljoscha Krettek
Hi, first question: are you manually keying by "userId % numberOfPartitions"? Flink internally does roughly "key.hash() % numPartitions" so it is enough to specify the userId as your key. Now, for you questions: 1. What Flink guarantees is that the state for a key k is always available when an el

How to perform multiple stream join functionality

2016-05-25 Thread prateekarora
Hi I am trying to port my spark application in flink. In spark i have used below command to join multiple stream : val stream=stream1.join(stream2).join(stream3).join(stream4) As per my understanding flink required window operation because flink don't works on RDD like spark. so i tried

whats is the purpose or impact of -yst( --yarnstreaming ) argument

2016-05-25 Thread prateekarora
Hi i am running flink kafka stream application . and have not seen any impact of -yst( --yarnstreaming ) argument in my application . i thought this argument is introduces in 1.0.2 . can any one explain what is the purpose of this argument . Regards Prateek -- View this message in co

Re: Incremental updates

2016-05-25 Thread Malgorzata Kudelska
Hi, I have the following situation. - a keyed stream with a key defined as: userId % numberOfPartitions - a custom flatMap transformation where I use a StateValue variable to keep the state of some calculations for each userId - my questions are: 1. Does flink guarantee that the users with a given

Re: stream keyBy without repartition

2016-05-25 Thread Bart Wyatt
Aljoscha, Thanks for the pointers. I was able to get a pretty simple utility class up and running that gives me basic keyed fold/reduce/windowedFold/windowedReduce operations that don't change the partitioning. ​ This will be invaluable until an official feature is supported Cheers, -Bart

Re: enableObjectReuse and multiple downstream operators

2016-05-25 Thread Aljoscha Krettek
Hi Bart, yup, this is a bug. AFAIK it is now known, would you like to open the Jira issue for it? If not, I can also open one. The problem is in the interaction of how chaining works in the streaming API with object reuse. As you said, with how it is implemented it serially calls the two map funct

text clustering

2016-05-25 Thread Souli Bilel
> > > Hello all ; > > Currently i have a project about 'news documents ' from streaming > > sources :the goal is to cluster and analyse these documents; unfortuntly > > i am blocked on the phase of clustering ; is there an algorithme or a > > methode that i can use for text clustering, lik

Re: Non blocking operation in Apache flink

2016-05-25 Thread Maatary Okouya
Thank you, i will study that. it is a bit more raw i would say. The thing is my source is Kafka. I will have to see how to combine all of that altogether in the most elegant way possible. Will get back to you on this, after i scratch my head enough. Best, Daniel On Wed, May 25, 2016 at 11:02 AM

enableObjectReuse and multiple downstream operators

2016-05-25 Thread Bart Wyatt
(For reference, I'm in 1.0.3) I have a job that looks like this: DataStream input = ... input .map(MapFunction...) .addSink(...); input .map(MapFunction...) ?.addSink(...); If I do not call enableObjectReuse() it works, if I do call enableObjectReuse() it throws: java

Re: Non blocking operation in Apache flink

2016-05-25 Thread Aljoscha Krettek
I see what you mean now. The Akka Streams API is very interesting, in how they allow async calls. For Flink, I think you could implement it as a custom source that listens for the change stream, starts futures to get data from the database and emits elements when the future completes. I quickly sk

Re: Logging Exceptions

2016-05-25 Thread David Kim
Awesome, thank you! David On Wed, May 25, 2016 at 4:54 AM Aljoscha Krettek wrote: > Hi David, > you are right, for some exceptions Flink only forwards to the > web-dashboard/application client but does not print to the log file. I > opened a Jira issue to track this: FLINK-3969 >

Re: stream keyBy without repartition

2016-05-25 Thread Aljoscha Krettek
In the long run we probably have to provide a hook in the API for this, yes. On Wed, 25 May 2016 at 15:54 Bart Wyatt wrote: > ​I will give this a shot this morning. > > > Considering this and the other email "Does Kafka connector leverage Kafka > message keys?" which also ends up talking about h

Re: Logging with slf4j

2016-05-25 Thread Stefano Baghino
It looks like the logs should show up. Are you using log4j or logback? Did you include the binding to slf4j for the library you're using? Did any warning on slf4j appear on stdout when the job is run? If you set up all correctly, can you share your logback/log4j config file with us so that we can m

Logging with slf4j

2016-05-25 Thread simon peyer
Hi guys Trying to log stuff, I used the print/println function which works quite well. But now I would like to use the slf4j logger. For each class/object in scala I intiliaized a logger like this: var log: Logger = LoggerFactory.getLogger(getClass) then I did some logging log.error("Error")

Re: stream keyBy without repartition

2016-05-25 Thread Bart Wyatt
​I will give this a shot this morning. Considering this and the other email "Does Kafka connector leverage Kafka message keys?" which also ends up talking about hacking around KeyedStream's use of a HashPartitioner<>(...) is it worth looking in to providing a KeyedStream constructor that uses

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread omaralvarez
Thanks to everybody, all my doubts are solved. I gotta give it to you guys, the answers were really fast! Cheers, Omar. -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Data-ingestion-using-a-Flink-TCP-Server-tp7134p7169.html Sent from the A

Re: How to perform this join operation?

2016-05-25 Thread Stephan Ewen
Hi Elias! I think you brought up a couple of good issues. Let me try and summarize what we have so far: 1) Joining in a more flexible fashion => The problem you are solving with the trailing / sliding window combination: Is the right way to phrase the join problem "join records where key is e

Re: Merging sets with common elements

2016-05-25 Thread Simone Robutti
@Till: A more meaningful example would be the following: from {{1{1,2}},{2,{2,3}},{3,{4,5},{4{1,27 the result should be {1,2,3,27},{4,5} because the set #1,#2 and #4 have at least one element in common. If you see this as a graph where the elements of the sets are nodes and a set express a fu

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread Aljoscha Krettek
Hi, yes, if you have 4 independent non-parallel sources they will be executed independently in different threads. Cheers, Aljoscha On Wed, 25 May 2016 at 13:40 omaralvarez wrote: > Thanks for your answers, this makes the implementation way easier, I don't > have to worry about queues. > > I wil

Re: Incremental updates

2016-05-25 Thread Aljoscha Krettek
Hi, right now, this does not work but we're is also actively working on that. This is the design doc for part one of the necessary changes: https://docs.google.com/document/d/1G1OS1z3xEBOrYD4wSu-LuBCyPUWyFd9l3T9WyssQ63w/edit?usp=sharing Cheers, Aljoscha On Wed, 25 May 2016 at 13:32 Malgorzata Kud

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread omaralvarez
Thanks for your answers, this makes the implementation way easier, I don't have to worry about queues. I will take a look at the kafka connector. So my only remaining question is how serial stream sources are handled. If I have four independent streams, will the sources be handled by different

Re: Incremental updates

2016-05-25 Thread Malgorzata Kudelska
Hi, Thanks for your reply. Is Flink able to detect that an additional server joined and rebalance the processing? How is it done if I have a keyed stream and some custom ValueState variables? Cheers, Gosia 2016-05-25 11:32 GMT+02:00 Aljoscha Krettek : > Hi Gosia, > right now, Flink is not doing

Re: Does Kafka connector leverage Kafka message keys?

2016-05-25 Thread Stephan Ewen
Hi! @Krzysztof: If you use a very simple program like "read kafka" => "enrich (map / flatmap)" => "write kafka", then there will be no shuffle in Flink as well. It will be a very lightweight program, reusing the Kafka Partitioning. @Elias: KeyBy() assumes that the partitioning can be altered by F

Re: Combining streams with static data and using REST API as a sink

2016-05-25 Thread Josh
Hi Aljoscha, That sounds exactly like the kind of feature I was looking for, since my use-case fits the "Join stream with slowly evolving data" example. For now, I will do an implementation similar to Max's suggestion. Of course it's not as nice as the proposed feature, as there will be a delay in

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread Stephan Ewen
Hi! A typical example of a parallel source is the Kafka Source. Actually, other threads than the main run() thread can call ctx.collect(), provided they use the checkpoint lock properly. The Kafka source does that. Stephan On Wed, May 25, 2016 at 11:50 AM, omaralvarez wrote: > Hi, > > Thank

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread omaralvarez
Hi, Thank you very much for your answer. There is one more doubt in my mind. How are not parallelized source funtions processed? For instance, lets say I have four streams that implement SourceFunction, will they be placed on different parallel instances or will they be processed sequentially by t

Re: Non blocking operation in Apache flink

2016-05-25 Thread Maatary Okouya
Maybe the following can illustrate better what i mean http://doc.akka.io/docs/akka/2.4.6/scala/stream/stream-integrations.html#Integrating_with_External_Services On Wed, May 25, 2016 at 5:16 AM, Aljoscha Krettek wrote: > Hi, > there is no functionality to have asynchronous calls in user function

Re: Logging Exceptions

2016-05-25 Thread Aljoscha Krettek
Hi David, you are right, for some exceptions Flink only forwards to the web-dashboard/application client but does not print to the log file. I opened a Jira issue to track this: FLINK-3969 . Thanks for reporting! Aljoscha On Mon, 23 May 2016 at

Re: Non blocking operation in Apache flink

2016-05-25 Thread Maatary Okouya
Thank you for your answer. Maybe I should have mentioned that I am at the beginning with both framework, somewhat making a choice by evaluating their capability. I know Akka stream better. So my question would be simple. Let say that 1-/ have a stream of event that are simply information about t

Re: Merging sets with common elements

2016-05-25 Thread Aljoscha Krettek
Hi, if I understood it correctly the "key" in that case would be a fuzzy/probabilistic key. I'm not sure this can be computed using either the sort-based or hash-based joinging/grouping strategies of Flink. Maybe we can find something if you elaborate. Cheers, Aljoscha On Wed, 25 May 2016 at 10:2

Re: Merging sets with common elements

2016-05-25 Thread Till Rohrmann
Hi Simone, could you elaborate a little bit on the actual operation you want to perform. Given a data set {(1, {1,2}), (2, {2,3})} what's the result of your operation? Is the result { ({1,2}, {1,2,3}) } because the 2 is contained in both sets? Cheers, Till On Wed, May 25, 2016 at 10:22 AM, Simon

Re: Dynamic partitioning for stream output

2016-05-25 Thread Kostas Kloudas
Hi Juho, To be more aligned with the semantics in Flink, I would suggest a solution with a single modified RollingSink that caches multiple buckets (from the Bucketer) and flushes (some of) them to disk whenever certain time or space criteria are met. I would say that it is worth modifying th

Re: Incremental updates

2016-05-25 Thread Aljoscha Krettek
Hi Gosia, right now, Flink is not doing incremental checkpoints. Every checkpoint is fully valid in isolation. Incremental checkpointing came up several times on ML discussions and we a planning to work on it once someone finds some free time. Cheers, Aljoscha On Wed, 25 May 2016 at 09:29 Rubén C

Re: subtasks and objects

2016-05-25 Thread Aljoscha Krettek
Hi, I think this is correct, yes. It is probably not a good idea to use static code in Flink jobs. Cheers, Aljoscha On Wed, 25 May 2016 at 00:27 Stavros Kontopoulos wrote: > Hey, > > Is it true that since taskmanager (a jvm) may have multiple subtasks > implementing the same operator and thus s

Re: Flink's WordCount at scale of 1BLN of unique words

2016-05-25 Thread Aljoscha Krettek
Hi, first, regarding your use-case questions: 1. if you do a keyBy(..) on the "word" then the same words will end up on the same machine. 2. This depends on the StateBackend that you use. For example, there is the FileStateBackend that keeps state in memory and does checkpoints to a file system an

Re: Data ingestion using a Flink TCP Server

2016-05-25 Thread Aljoscha Krettek
Hi, regarding your first question. I think it is in general not safe to call ctx.collect() from Threads other than the Thread that is invoking the run() method of your SourceFunction. What I would suggest is to have a queue that your reader threads put data into and then read from that queue in the

Re: Non blocking operation in Apache flink

2016-05-25 Thread Aljoscha Krettek
Hi, there is no functionality to have asynchronous calls in user functions in Flink. The asynchronous action feature in Spark is also not meant for such things, it is targeted at programs that need to pull all data to the application master. In Flink this is not necessary because you can specify a

Re: stream keyBy without repartition

2016-05-25 Thread Aljoscha Krettek
Hi, what Kostas said is correct. You can however, hack it. You would have to manually instantiate a WindowOperator and apply it on the non-keyed DataStream while still providing a key-selector (and serializer) for state. This might sound complicated but I'll try and walk you through the steps. Ple

Re: Flink config as an argument (akka.client.timeout)

2016-05-25 Thread Ufuk Celebi
On Wed, May 25, 2016 at 8:49 AM, Juho Autio wrote: > Is there any way to set akka.client.timeout (or other flink config) when > calling bin/flink run instead of editing flink-conf.yaml? I tried to add it > as a -yD flag but couldn't get it working. I think this currently only works for YARN with

Re: writeAsCSV with partitionBy

2016-05-25 Thread Aljoscha Krettek
Hi, the RollingSink can only be used with streaming. Adding support for dynamic paths based on element contents is certainly interesting. I imagine it can be tricky, though, to figure out when to close/flush the buckets. Cheers, Aljoscha On Wed, 25 May 2016 at 08:36 KirstiLaurila wrote: > Maybe

Merging sets with common elements

2016-05-25 Thread Simone Robutti
Hello, I'm implementing MinHash for reccomendation on Flink. I'm almost done but I need an efficient way to merge sets of similar keys together (and later join these sets of keys with more data). The actual data structure is of the form DataSet[(Int,Set[Int])] where the left element of the tuple

Re: Incremental updates

2016-05-25 Thread Rubén Casado
Hi Gosia You can have a look to the PROTEUS project we are doing [1]. We are implementing incremental version of analytics operations. For example you can see in [2] the implementation of the incremental AVG. Maybe the code can give you some ideas :-) [1] https://github.com/proteus-h2020/pr