RE: Zeppelin

2019-04-26 Thread Smirnov Sergey Vladimirovich (39833)
Hi, Dawid, great, thanks for answering. Jeff, flink 1.8 with default settings, standalone cluster, one job node and three task managers nodes. zeppelin 0.9 config checked "Connect to existing cluster" host: 10.219.179.16 port: 6123 create simple notebook: %flink ExecutionEnvironment env = Execut

RE: kafka partitions, data locality

2019-04-26 Thread Smirnov Sergey Vladimirovich (39833)
Hi, Dawid, great, thanks! Any plans to make it stable? 1.9? Regards, Sergey From: Dawid Wysakowicz [mailto:dwysakow...@apache.org] Sent: Thursday, April 25, 2019 10:54 AM To: Smirnov Sergey Vladimirovich (39833) ; Ken Krugler Cc: user@flink.apache.org; d...@flink.apache.org Subject: Re: kafka

Re: How to implement custom stream operator over a window? And after the Count-Min Sketch?

2019-04-26 Thread Felipe Gutierrez
Hi Rong, thanks for your insights. I agree with the three points that you said. My plan is to implement an operator that compute the Count-min sketch and developers can assign functions to increase the estimative of the sketch (adding more/different functions the sketch will be more precise, hence

Re: kafka partitions, data locality

2019-04-26 Thread Stefan Richter
Hi Sergey, The point why this I flagged as beta is actually less about stability but more about the fact that this is supposed to be more of a "power user" feature because bad things can happen if your data is not 100% correctly partitioned in the same way as Flink would partition it. This is w

Re: Emitting current state to a sink

2019-04-26 Thread Timo Walther
Hi Avi, did you have a look at the .connect() and .broadcast() API functionalities? They allow you to broadcast a control stream to all operators. Maybe this example [1] or other examples in this repository can help you. Regards, Timo [1] https://github.com/ververica/flink-training-exercis

Re: Exceptions when launching counts on a Flink DataSet concurrently

2019-04-26 Thread Timo Walther
Hi Juan, as far as I know we do not provide any concurrency guarantees for count() or collect(). Those methods need to be used with caution anyways as the result size must not exceed a certain threshold. I will loop in Fabian who might know more about the internals of the execution? Regards,

Serialising null value in case-class

2019-04-26 Thread Averell
Good day, I have a case-class defined like this: case class MyClass(ts: Long, s1: String, s2: String, i1: Integer, i2: Integer) object MyClass { val EMPTY = MyClass(0L, null, null, 0, 0) def apply(): MyClass = EMPTY } My code has been running fine (I was not aware of

Re: Identify orphan records after joining two streams

2019-04-26 Thread Averell
Hi Dawid, I just tried to change from CoProcessFunction with onTimer() to ProcessWindowFunction with Trigger and TumblingWindow. So I can key my stream by (id) instead of (id, eventTime). With this, I can use /reinterpretAsKeyedStream/, and hope that it would give better performance. I can also us

How to let Flink 1.7.X run Flink session cluster on YARN in Java 7 default environment

2019-04-26 Thread 胡逸才
At present, all YARN clusters adopt JAVA 7 environment. While trying to use FLINK to handle the deployment of flow processing business scenarios, it was found that FLINK ON YARN mode always failed to perform a session task. The application log of YARN shows Unsupported major. minor version 52.0

Re: Serialising null value in case-class

2019-04-26 Thread Timo Walther
Hi Averell, the reason for this lies in the internal serializer implementation. In general, the composite/wrapping type serializer is responsible for encoding nulls. The case class serialzer does not support nulls, because Scala discourages the use of nulls and promotes `Option`. Some seriali

Re: Serialising null value in case-class

2019-04-26 Thread Averell
Thank you Timo. In term of performance, does the use of Option[] cause performance impact? I guess that there is because there will be one more layer of object handling, isn't it? I am also confused about choosing between primitive types (Int, Long) vs object type (Integer, JLong). I have seen ma

Re: Serialising null value in case-class

2019-04-26 Thread Timo Walther
Currently, tuples and case classes are the most efficient data types because they avoid the need for special null handling. Everything else is hard to estimate. You might need to perform micro benchmarks with the serializers you want to use if you have a very performance critical use case. Obje

Re: RichAsyncFunction Timer Service

2019-04-26 Thread Mikhail Pryakhin
Hi David, Thank you! Yes, fair enough, but take for instance a BucketingSink class[1], it is a RichFunction which employs Timeservice to execute time-based logic, which is not directly associated with an event flow, like for example closing files every n minutes, etc. In an AsyncFunction I int

Re: Exceptions when launching counts on a Flink DataSet concurrently

2019-04-26 Thread Juan Rodríguez Hortalá
Hi Timo, Thanks for your answer. I was surprised to have problems calling those methods concurrently, because I though data sets were immutable. Now I understand calling count or collect mutates the data set, not its contents but some kind of execution plan included in the data set. I suggest add

Re: Flink CLI

2019-04-26 Thread Gary Yao
Hi Steve, (1) The CLI action you are looking for is called "modify" [1]. However, we want to temporarily disable this feature beginning from Flink 1.9 due to some caveats with it [2]. If you have objections, it would be appreciated if you could comment on the respective thread on

Re: Emitting current state to a sink

2019-04-26 Thread Avi Levi
Hi Timo, I defiantly did. but broadcasting a command and trying to address the persisted state (I mean the state of the data stream and not the broadcasted one) you get the exception that I wrote (java.lang.NullPointerException: No key set. This method should not be called outside of a keyed contex

FileInputFormat that processes files in chronological order

2019-04-26 Thread Sergei Poganshev
Given a directory with input files of the following format: /data/shard1/file1.json /data/shard1/file2.json /data/shard1/file3.json /data/shard2/file1.json /data/shard2/file2.json /data/shard2/file3.json Is there a way to make FileInputFormat with parallelism 2 split processing by "shard" (folder

Working around lack of SQL triggers

2019-04-26 Thread deklanw
I'm not sure how to express my logic simply where early triggers are a necessity. My application has large windows (2 weeks~) where early triggering is absolutely required. But, also, my application has mostly relatively simple logic which can be expressed in SQL. There's a ton of duplication, lik