Weird error in submitting a flink job to yarn cluster

2017-10-02 Thread vipul singh
Hello, I am working on a ParquetSink writer, which will convert a kafka stream to parquet format. I am having some weird issues in deploying this application to a yarn cluster. I am not 100% sure this falls into a flink related error, but I wanted to reach out to folks here incase it might be. I

Re: kafka consumer parallelism

2017-10-02 Thread Timo Walther
Hi, I'm not a Kafka expert but I think you need to have more than 1 Kafka partition to process multiple documents at the same time. Make also sure to send the documents to different partitions. Regards, Timo Am 10/2/17 um 6:46 PM schrieb r. r.: Hello I'm running a job with "flink run -p5"

Re: Session Window set max timeout

2017-10-02 Thread Timo Walther
Hi, I would recommend to implement your custom trigger in this case. You can override the default trigger of your window: https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/windows.html#triggers This is the interface where you can control the triggering: https://

Re: Flink Watermark and timing

2017-10-02 Thread Timo Walther
Hi Björn, I don't know if I get your example correctly, but I think your explanation "All events up to and equal to watermark should be handled in the prevoius window" is not 100% correct. Watermarks just indicate the progress ("until here we have seen all events with lower timestamp than X"

Re: Avoid duplicate messages while restarting a job for an application upgrade

2017-10-02 Thread Piotr Nowojski
We are planning to work on this clean shut down after releasing Flink 1.4. Implementing this properly would require some work, for example: - adding some checkpoint options to add information about “closing”/“shutting down” event - add clean shutdown to source functions API - implement handling o

kafka consumer parallelism

2017-10-02 Thread r. r.
Hello I'm running a job with "flink run -p5" and additionally set env.setParallelism(5). The source of the stream is Kafka, the job uses FlinkKafkaConsumer010. In Flink UI though I notice that if I send 3 documents to Kafka, only one 'instance' of the consumer seems to receive Kafka's record and

Re: Enriching data from external source with cache

2017-10-02 Thread Derek VerLee
Thanks Timo, watching the video now. I did try out the method with iteration in a simple prototype and it works.  But you are right, combining it with the other requirements into a single process function has so far resulted in more complexity than I'd like, and it'

Re: Avoid duplicate messages while restarting a job for an application upgrade

2017-10-02 Thread Antoine Philippot
Thanks Piotr for your answer, we sadly can't use kafka 0.11 for now (and until a while). We can not afford tens of thousands of duplicated messages for each application upgrade, can I help by working on this feature ? Do you have any hint or details on this part of that "todo list" ? Le lun. 2 o

Re: Avoid duplicate messages while restarting a job for an application upgrade

2017-10-02 Thread Piotr Nowojski
Hi, For failures recovery with Kafka 0.9 it is not possible to avoid duplicated messages. Using Flink 1.4 (unreleased yet) combined with Kafka 0.11 it will be possible to achieve exactly-once end to end semantic when writing to Kafka. However this still a work in progress: https://issues.apach

Session Window set max timeout

2017-10-02 Thread ant burton
Is it possible to limit session windowing to a max of n seconds/hours etc? i.e. I would like a session window, but if a session runs for an unacceptable amount of time, I would like to close it. Thanks,

At end of complex parallel flow, how to force end step with parallel=1?

2017-10-02 Thread Garrett Barton
I have a complex alg implemented using the DataSet api and by default it runs with parallel 90 for good performance. At the end I want to perform a clustering of the resulting data and to do that correctly I need to pass all the data through a single thread/process. I read in the docs that as long

Savepoints - jobmanager.rpc.address

2017-10-02 Thread ant burton
Hi, When taking a savepoint on AWS EMR I get the following error [hadoop@ip-10-12-169-172 ~]$ flink savepoint e14a6402b6f1e547c4adf40f43861c27 Retrieving JobManager. The program finished with the following exception: org.apache.flink

Avoid duplicate messages while restarting a job for an application upgrade

2017-10-02 Thread Antoine Philippot
Hi, I'm working on a flink streaming app with a kafka09 to kafka09 use case which handles around 100k messages per seconds. To upgrade our application we used to run a flink cancel with savepoint command followed by a flink run with the previous saved savepoint and the new application fat jar as

RE: In-memory cache

2017-10-02 Thread Marchant, Hayden
Nice idea. Actually we are looking at connect for other parts of our solution in which the latency is less critical. A few considerations of not using ‘connect’ in this case were: 1. To isolate the two streams from each other to reduce complexity, simplify debugging etc…. – since we are

Fwd: Consult about flink on mesos cluster

2017-10-02 Thread Bo Yu
Hello all, This is Bo, I met some problems when I tried to use flink in my mesos cluster (1 master, 2 slaves (cpu has 32 cores)). I tried to start the mesos-appmaster.sh in marathon, the job manager is started without problem. mesos-appmaster.sh -Djobmanager.heap.mb=1024 -Dtaskmanager.heap.mb=1024

Re: how many 'run -c' commands to start?

2017-10-02 Thread r. r.
Thanks, Chesnay, that was indeed the problem. It also explains why -p5 was not working for me from the cmdline Best regards Robert > Оригинално писмо >От: Chesnay Schepler ches...@apache.org >Относно: Re: how many 'run -c' commands to start? >До: "r. r." >Изпратено

Re: In-memory cache

2017-10-02 Thread Stavros Kontopoulos
How about connecting two streams of data, one from the reference data and one from the main data (I assume using key streams as you mention QueryableState) and keep state locally within the operator. The idea is to have a local sub-copy of the reference data within the operator that is updated from

Re: How flink monitor source stream task(Time Trigger) is running?

2017-10-02 Thread yunfan123
Thank you. "If SourceFunction.run methods returns without an exception Flink assumes that it has cleanly shutdown and that there were simply no more elements to collect/create by this task. " This sentence solve my confusion. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.

Re: How to deal with blocking calls inside a Sink?

2017-10-02 Thread Federico D'Ambrosio
Hi Timo, thank you for your response. Just yesterday I tried using the jdbc connector and unfortunately I found out that HivePreparedStatement and HiveStatement implementations still don't have an addBatch implementation, whose interface is being used in the connector. The first dirty solution tha

Flink Watermark and timing

2017-10-02 Thread Björn Zachrisson
Hi, I have a question regarding timing of events. According to; https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html#event-time-and-watermarks All events up to and equal to watermark should be handled in "the prevoius window". In my case I use event-timestamp. I'm t

In-memory cache

2017-10-02 Thread Marchant, Hayden
We have an operator in our streaming application that needs to access 'reference data' that is updated by another Flink streaming application. This reference data has about ~10,000 entries and has a small footprint. This reference data needs to be updated ~ every 100 ms. The required latency for

Re: How to deal with blocking calls inside a Sink?

2017-10-02 Thread Timo Walther
Hi Federico, would it help to buffer events first and perform batches of insertions for better throughtput? I saw some similar work recently here: https://tech.signavio.com/2017/postgres-flink-sink But I would first try the AsyncIO approach, because actually this is a use case it was made fo

Re: Enriching data from external source with cache

2017-10-02 Thread Timo Walther
Hi Derek, maybe the following talk can inspire you, how to do this with joins and async IO: https://www.youtube.com/watch?v=Do7C4UJyWCM (around the 17th min). Basically, you split the stream and wait for an Async IO result in a downstream operator. But I think having a transient guava cache

Re: ArrayIndexOutOfBoundExceptions while processing valve output watermark and while applying ReduceFunction in reducing state

2017-10-02 Thread Federico D'Ambrosio
As a followup: the flink job has currently an uptime of almost 24 hours, with no checkpoint failed or restart whereas, with async snapshots, it would have already crashed 50 or so times. Regards, Federico 2017-09-30 19:01 GMT+02:00 Federico D'Ambrosio < federico.dambro...@smartlab.ws>: > Thank

How to deal with blocking calls inside a Sink?

2017-10-02 Thread Federico D'Ambrosio
Hi, I've implemented a sink for Hive as a RichSinkFunction, but once I've integrated it in my current flink job, I noticed that the processing of the events slowed down really bad, I guess because of some blocking calls that need to be when interacting with hive streaming api. So, what can be done

Re: Windowing isn't applied per key

2017-10-02 Thread Timo Walther
Hi Marcus, from a first glance your pipeline looks correct. It should not be executed with a parallelism of one, if not specified explicitly. Which time semantics are you using? If it is event-time, I would check your timestamps and watermarks assignment. Maybe you can also check in the web f