Processing millions of messages in milliseconds real time -- Architecture guide required

2016-04-18 Thread Deepak Sharma
Hi all, I am looking for an architecture to ingest 10 mils of messages in the real time streaming mode. If anyone has worked on similar kind of architecture , can you please point me to any documentation around the same like what should be the architecture , which all components/big data ecosystem

Flink + S3

2016-04-18 Thread Michael-Keith Bernard
Hello Flink Users! I'm a Flink newbie at the early stages of deploying our first Flink cluster into production and I have a few questions about wiring up Flink with S3: * We are going to use the HA configuration[1] from day one (we have existing zk infrastructure already). Can S3 be used as a s

Does Flink support joining multiple streams based on event time window now?

2016-04-18 Thread Yifei Li
Hi, I am new to Flink and I've read some documentation and think Flink may fit my scenario. Here is my scenario: 1. Assume I have 3 streams: S1(id, name, email, action, date), S2(id, name, email, level, date), S3(id, name, position, date). *2. S2 always delays(hours to days, not determined..) *

Re: Gracefully stop long running streaming job

2016-04-18 Thread Robert Schmidtke
I'm on 0.10.2 which seems to be still lacking this feature. Anyway I'm happy to see it'll be in future releases, so I'll get to enjoy it once I upgrade :) I'm using a FlinkKafkaConsumer081 for the record. Anyway, thanks a bunch Robert On Tue, Apr 19, 2016 at 12:14 AM, Matthias J. Sax wrote: > I

Re: Gracefully stop long running streaming job

2016-04-18 Thread Matthias J. Sax
If all your sources implements Stoppable interface, you can STOP a job. ./bin/flink stop JobID STOP is however quite new and it is ongoing work to make available sources stoppable (some are already). Not sure what kind of sources you are using right now. -Matthias On 04/18/2016 10:50 PM, Rober

Gracefully stop long running streaming job

2016-04-18 Thread Robert Schmidtke
Hi everyone, I am running a streaming benchmark which involves a potentially infinitely running Flink Streaming Job. I run it blocking on YARN using ./bin/flink run ... and then send the command into background, remembering its PID to kill it later on. While this gets the work done, the job always

Hash tables - joins, cogroup, deltaIteration

2016-04-18 Thread Ovidiu-Cristian MARCU
Hi, Can you please confirm if there is any update regarding the hash tables use cases, as in [1] it is specified that Hash tables are used in Joins and for the Solution set in iterations (pending work to use them for grouping/aggregations)? I am interested in the pending work progress and also

class java.util.UUID is not a valid POJO type

2016-04-18 Thread Leonard Wolters
Hi, Quick question. I'm currently implementing our Machine Learning into Spark using the Scala interface. Do I understand correctly that when using scala only, you can only use primitive types? How can I register the UUID class in order to get supported for (de)serializing between the nodes? BTW

Re: Compression - AvroOutputFormat and over network ?

2016-04-18 Thread Robert Metzger
Hi Tarandeep, I think for that you would need to set a codec factory on the DataFileWriter. Sadly we don't expose that method to the user. If you want, you can contribute this change to Flink. Otherwise, I can quickly fix it. Regards, Robert On Mon, Apr 18, 2016 at 2:36 PM, Ufuk Celebi wrote:

Re: throttled stream

2016-04-18 Thread Robert Metzger
Hi, I would also go for Niels approach. If the mapper has the same parallelism as the source and its right after it, it'll be chained to the source. The throttling then happens with almost no overhead. Regarding the ThrottledIterator: Afaik there is no iterator involved when reading data out of th

Re: Flink 0.10.2 and Kafka 0.8.1

2016-04-18 Thread Robert Metzger
Yeah, that's the problem Your Kafka 0.8.1 dependency [1] doesn't depend on kafka-clients. With the explicit dependency definition, you are overriding the transitive Kafka dependency from the kafka connector. [1] https://repo1.maven.org/maven2/org/apache/kafka/kafka_2.10/0.8.1/kafka_2.10-0.8.1.pom

Re: Task Slots and Heterogeneous Tasks

2016-04-18 Thread Till Rohrmann
No, it's not possible at the moment to deploy more than one task of the same kind to a single slot. On Fri, Apr 15, 2016 at 8:08 PM, Maxim wrote: > I see. Sharing slots among subtasks makes sense. > So by default a subtask that executes a map function that calls a high > latency external servic

Re: Flink HDFS State Backend

2016-04-18 Thread Jason Brelloch
Yep, that was the problem. Thanks! On Mon, Apr 18, 2016 at 11:36 AM, Aljoscha Krettek wrote: > Hi, > could it be that your state is very small? In that case the state is not > actually stored in HDFS but on the job manager because writing it to HDFS > and storing a handle to that in the JobMana

Re: Flink 0.10.2 and Kafka 0.8.1

2016-04-18 Thread Robert Schmidtke
Turns out when I remove the explicit dependency on kafka_2.10 v. 0.8.1, then the dependencies are properly included. Guess there was a conflict somehow? I'll need to figure out if the rest of the code is fine with kafka_2.10 v. 0.8.2.0 as well. On Mon, Apr 18, 2016 at 4:32 PM, Robert Schmidtke wr

Re: Flink HDFS State Backend

2016-04-18 Thread Aljoscha Krettek
Hi, could it be that your state is very small? In that case the state is not actually stored in HDFS but on the job manager because writing it to HDFS and storing a handle to that in the JobManager would be more expensive. Cheers, Aljoscha On Mon, 18 Apr 2016 at 17:20 Jason Brelloch wrote: > Hi

Flink HDFS State Backend

2016-04-18 Thread Jason Brelloch
Hi everyone, I am trying to set up flink with a hdfs state backend. I configured state.backend and state.backend.fs.checkpointdir parameters in the flink-conf.yaml. I run the flink task and the checkpoint directories are created in hdfs, so it appears it can connect and talk to hdfs just fine. U

Flink + Kafka + Scalabuff issue

2016-04-18 Thread Alexander Gryzlov
Hello, Has anyone tried using ScalaBuff (https://github.com/SandroGrzicic/ScalaBuff) with Flink? We’re trying to consume Protobuf messages from Kafka 0.8 and have hit a performance issue. We run this code: https://gist.github.com/clayrat/05ac17523fcaa52fcc5165d9edb406b8 (where Foo is pre-gene

Re: Flink 0.10.2 and Kafka 0.8.1

2016-04-18 Thread Robert Schmidtke
Hi Robert, thanks for your offer. After playing around a bit I would like to take it, if you have the time: https://github.com/robert-schmidtke/HiBench/blob/flink-streaming/src/streambench/flinkbench/pom.xml I would guess the POM is similar to the one in the sample project, yet when building it,

Re: Flink 0.10.2 and Kafka 0.8.1

2016-04-18 Thread Robert Metzger
If you want you can share the pom of your project privately. On Mon, Apr 18, 2016 at 4:05 PM, Robert Schmidtke wrote: > You're right, it does not. When including it the resulting jar has the > Kafka dependencies bundled. Now it's up to me to figure out the difference > between the sample project

Re: Flink 0.10.2 and Kafka 0.8.1

2016-04-18 Thread Robert Schmidtke
You're right, it does not. When including it the resulting jar has the Kafka dependencies bundled. Now it's up to me to figure out the difference between the sample project and the one I'm working on. Thanks! Really quick help. Robert On Mon, Apr 18, 2016 at 4:02 PM, Robert Metzger wrote: > Hi

Re: Flink 0.10.2 and Kafka 0.8.1

2016-04-18 Thread Robert Metzger
Hi, the problem with the posted project is that it doesn't have the Flink kafka connector as a dependency. On Mon, Apr 18, 2016 at 3:56 PM, Robert Schmidtke wrote: > Hi Robert, > > thanks for your hints. I was not sure whether I was building a proper fat > jar, as I have not used the Flink Arche

Re: Flink 0.10.2 and Kafka 0.8.1

2016-04-18 Thread Robert Schmidtke
Hi Robert, thanks for your hints. I was not sure whether I was building a proper fat jar, as I have not used the Flink Archetype for my project. However, I have set up a sample project at https://github.com/robert-schmidtke/flink-test/ which is nothing more than the Quickstart Archetype plus the

Re: Configuring task slots and parallelism for single node Maven executed

2016-04-18 Thread Prez Cannady
Thank you both. Will let you guys know how it works out. Prez Cannady p: 617 500 3378 e: revp...@opencorrelate.org GH: https://github.com/opencorrelate LI: https://www.linkedin.com/in/revprez

Re: Compression - AvroOutputFormat and over network ?

2016-04-18 Thread Ufuk Celebi
Hey Tarandeep, regarding the network part: not possible at the moment. It's pretty straight forward to add support for it, but no one ever got around to actually implementing it. If you would like to contribute, I am happy to give some hints about which parts of the system would need to be modifie

Re: Flink ML 1.0.0 - Saving and Loading Models to Score a Single Feature Vector

2016-04-18 Thread KirstiLaurila
Answering to myself if someone is having similar problems. So already saved matrices can be read and used in als like this: // Setup the ALS learnerd val als = ALS() val users = env.readFile(new TypeSerializerInputFormat[Factors](createTypeInformation[Factors]),"path") val i

Compression - AvroOutputFormat and over network ?

2016-04-18 Thread Tarandeep Singh
Hi, How can I set compression for AvroOutputFormat when writing files on HDFS? Also, can we set compression for intermediate data that is sent over network (from map to reduce phase) ? Thanks, Tarandeep

Re: Flink 0.10.2 and Kafka 0.8.1

2016-04-18 Thread Robert Metzger
Hi, did you check your user jar if it contains the Kafka classes? Are you building a fat jar? Are you manually excluding any dependencies? Flink's 0.10.2 Kafka connector depends on Kafka 0.8.2.0 [1] which in turn depends on kafka-clients 0.8.2.0 [2]. And the "kafka-clients" dependency also contain

Re: providing java system arguments(-D) to specific job

2016-04-18 Thread Till Rohrmann
That is correct. You can provide it also as a property to the CLI: -Denv.java.opts="-Dmy-prop=bla -Dmyprop2=bla2" Cheers, Till On Sun, Apr 17, 2016 at 3:56 PM, Igor Berman wrote: > for the sake of history(at task manager level): > in conf/flink-conf.yaml > env.java.opts: -Dmy-prop=bla -Dmy-prop

Re: fan out parallel-able operator sub-task beyond total slots number

2016-04-18 Thread Till Rohrmann
Hi Chen, two subtasks of the same operator can never be executed within the same slot/pipeline. The `slotSharingGroup` allows you to only control which subtasks of different operators can be executed along side in the same slot. It basically allows you to break pipelines into smaller ones. Therefo

Re: withBroadcastSet for a DataStream missing?

2016-04-18 Thread Till Rohrmann
Hi Stavros, yes that’s how you could do it. broadcast will send the data to every down stream operator. An element will be processed whenever it arrives at the iteration head. There is no synchronization. A windowed stream cannot be the input for a connected stream. Thus, the window results hav

Re: Configuring task slots and parallelism for single node Maven executed

2016-04-18 Thread Till Rohrmann
Hi Prez, 1. the configuration setting taskmanager.numberOfTaskSlots says with how many task slots a TaskManager will be started. As a rough rule of thumb, set this value to the number of cores of the machine the TM is running on. This this link [1] for further information. The conf

Re: Testing Kafka interface using Flink interactive shell

2016-04-18 Thread Mich Talebzadeh
Thanks Chiwan. It worked. Now I have this simple streaming program in Spark Scala that gets streaming data via Kafka. It is pretty simple. Please see attached. I am trying to make it work with Flink + Kafka Any hints will be appreciated. Thanks Dr Mich Talebzadeh LinkedIn * https://www.l

Re: Accessing StateBackend snapshots outside of Flink

2016-04-18 Thread Aljoscha Krettek
Hi, key refers to the key extracted by your KeySelector. Right now, for every named state (i.e. the name in the StateDescriptor) there is a an isolated RocksDB instance. Cheers, Aljoscha On Sat, 16 Apr 2016 at 15:43 Igor Berman wrote: > thanks a lot for the info, seems not too complex > I'll tr