question regarding flink local buffer pool

2021-01-05 Thread Eleanore Jin
Hi experts, I am running flink 1.10, the flink job is stateless. I am trying to understand how local buffer pool works: 1. lets say taskA and taskB both run in the same TM JVM, each task will have its own local buffer pool, and taskA will write to pool-A, and taskB will read from pool-A and write

Task scheduling of Flink

2021-01-05 Thread penguin.
Hello! Do you know how to modify the task scheduling method of Flink?

Using key.fields in 1.12

2021-01-05 Thread Aeden Jameson
I've upgraded from 1.11.1 to 1.12 in hopes of using the key.fields feature of the Kafa SQL Connector. My current connector is configured as , connector.type= 'kafka' connector.version = 'universal' connector.topic = 'my-topic' connector.properties.group.id = 'my-consumer-group' connector.pro

Re: "flink list" fails when zk-based HA is enabled in flink-conf.yaml

2021-01-05 Thread Yang Wang
Hi Dongwon, I think the root cause is that GenericCLI do not override the "high-availability.cluster-id" with specified application id. The GenericCLI is activated by "--target yarn-per-job". In the FlinkYarnSessionCli, we have done this. And the following command should work with/without ZooKeepe

Re: Job manager high availability job restarting

2021-01-05 Thread Yang Wang
I think it is the expected behavior. When the active JobManager loses leadership, the standby one will try to take over and recover the job from the latest successful checkpoint. The high availability just helps with leader election/retrieval and HA meta storage(e.g. job graphs, checkpoints, etc.)

Re: Batch with Flink Steraming API version 1.12.0

2021-01-05 Thread Robert Cullen
Arvid, I’m hoping to get your input on a process I’m working on. Originally I was using a streaming solution but noticed that the data in the sliding windows was getting too large over longer intervals and sometimes stopped processing altogether. Anyway, the total counts should be a fixed number s

Job manager high availability job restarting

2021-01-05 Thread Giselle van Dongen
Hi! We are running a high available Flink cluster in standalone mode with Zookeeper with 2 jobmanagers and 5 taskmanagers. When the jobmanager is killed, the standby jobmanager takes over. But the job is also restarted. Is this the default behavior and can we avoid job restarts (for jobmanage

Re: UDTAGG and SQL

2021-01-05 Thread Marco Villalobos
Hi Timo, Can you please elaborate a bit on what you mean? I am not sure that I completely understand. Thank you. Sincerely, Marco A. Villalobos On Tue, Jan 5, 2021 at 6:58 AM Timo Walther wrote: > A subquery could work but since you want to implement a UDTAGG anyway, > you can also move the

Re: Submitting a job in non-blocking mode using curl and the REST API

2021-01-05 Thread Chesnay Schepler
I'm not aware of any plans to change this behavior. Overall the community is rather split on what role the web-submission should play; whether it should be a genuine way to submit jobs with the same capabilities as the CLI, just for prototyping or not exist at all. As it stands we are somewhat

Re: Chaining 2 flink jobs through a KAFKA topic with checkpoint enabled

2021-01-05 Thread Daniel Peled
Thank you for your answers. We ended up changing the isolation level to read_uncommitted and it solves the latency problem between the two jobs understanding that we may have duplications in the second job when the first job fails and roll back. בתאריך יום ג׳, 5 בינו׳ 2021 ב-15:23 מאת 赵一旦 : > I t

Re: Batch with Flink Steraming API version 1.12.0

2021-01-05 Thread Arvid Heise
Sorry Robert for not checking the complete example. New sources are used with fromSource instead of addSource. It's not ideal but hopefully we can remove the old way rather soonish to avoid confusion. On Tue, Jan 5, 2021 at 5:23 PM Robert Cullen wrote: > Arvid, > > Thank you. Sorry, my last post

Re: Batch with Flink Steraming API version 1.12.0

2021-01-05 Thread Robert Cullen
Arvid, Thank you. Sorry, my last post was not clear. This line: env.addSource(source) does not compile since addSource is expecting a SourceFunction not a KafkaSource type. On Tue, Jan 5, 2021 at 11:16 AM Arvid Heise wrote: > Robert, > > here I modified your example with some highlights. > >

RE: Submitting a job in non-blocking mode using curl and the REST API

2021-01-05 Thread Adam Roberts
Thanks Chesnay for the prompt response - ah, so my cunning plan to use execution.attached=true doesn't sound so reasonable now then (I was going to look at providing that as a programArg next).   I did find, in https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.ht

Re: Batch with Flink Steraming API version 1.12.0

2021-01-05 Thread Arvid Heise
Robert, here I modified your example with some highlights. final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();env.setRuntimeMode(RuntimeExecutionMode.STREAMING); KafkaSource source = KafkaSource .builder() .setBootstrapServers("kafka-headl

Re: Submitting a job in non-blocking mode using curl and the REST API

2021-01-05 Thread Chesnay Schepler
All jobs going through the web-submission are run in detached mode for technical reasons (blocking of threads, and information having to be transported back to the JobManager for things like collect()). You unfortunately cannot run non-detached/attached/blocking jobs via the web submission, wh

Re: Batch with Flink Steraming API version 1.12.0

2021-01-05 Thread Robert Cullen
Arvid, Thanks, Can you show me an example of how the source is tied to the ExecutionEnivornment. final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setRuntimeMode(RuntimeExecutionMode.STREAMING); KafkaSource source = KafkaSource .builder()

"flink list" fails when zk-based HA is enabled in flink-conf.yaml

2021-01-05 Thread Dongwon Kim
Hi, I'm using Flink-1.12.0 and running on Hadoop YARN. After setting HA-related properties in flink-conf.yaml, high-availability: zookeeper high-availability.zookeeper.path.root: /recovery high-availability.zookeeper.quorum: nm01:2181,nm02:2181,nm03:2181 high-availability.storageDir: hdfs:///

Submitting a job in non-blocking mode using curl and the REST API

2021-01-05 Thread Adam Roberts
Hey everyone, I've got an awesome looking Flink cluster set up with web.submit.enable=true, and plenty of bash for handling jar upload and then submission to a JobManager - all good so far.   Unfortunately, when I try to submit the classic WordCount example, I get a massive error with the jist of i

Re: UDTAGG and SQL

2021-01-05 Thread Timo Walther
A subquery could work but since you want to implement a UDTAGG anyway, you can also move the implementation there and keep the SQL query simple. But this is up to you. Consecutive windows are supported. Regards, Timo On 05.01.21 15:23, Marco Villalobos wrote: Hi Timo, Thank you for the quic

Re: [DISCUSS] FLIP-147: Support Checkpoints After Tasks Finished

2021-01-05 Thread Yun Gao
Hi Avrid, Very thanks for the feedbacks! For the second issue, sorry I think I might not make it very clear, I'm initially thinking the case that for example for a job with graph A -> B -> C, when we compute which tasks to trigger, A is still running, so we trigger A to

Re: UDTAGG and SQL

2021-01-05 Thread Marco Villalobos
Hi Timo, Thank you for the quick response. Neither COLLECT nor LISTAGG work because they only accept one column. I am trying to collect all the rows and columns into one object. Like a List for example. Later, I need make calculations upon all the rows that were just collected within a window.

Re: UDTAGG and SQL

2021-01-05 Thread Timo Walther
Hi Marco, nesting aggregated functions is not allowed in SQL. The exception could be improved though. I guess the planner searches for a scalar function called `MyUDTAGG` in your example and cannot find one. Maybe the built-in function `COLLECT` or `LISTAGG`is what you are looking for? htt

UDTAGG and SQL

2021-01-05 Thread Marco Villalobos
I am trying to use User defined Table Aggregate function directly in the SQL so that I could combine all the rows collected in a window into one object. GIVEN a User defined Table Aggregate function public class MyUDTAGG extends TableAggregateFunction { public PurchaseWindow createAcc

Re: Chaining 2 flink jobs through a KAFKA topic with checkpoint enabled

2021-01-05 Thread 赵一旦
I think what you need is http://kafka.apache.org/documentation/#consumerconfigs_isolation.level . The isolation.level setting's default value is read_uncommitted. So, maybe you do not use the default setting? 赵一旦 于2021年1月5日周二 下午9:10写道: > I do not have this problem, so I guess it is related with

Re: Chaining 2 flink jobs through a KAFKA topic with checkpoint enabled

2021-01-05 Thread 赵一旦
I do not have this problem, so I guess it is related with the config of your kafka producer and consumer, and maybe kafka topic properties or kafka server properties also. Arvid Heise 于2021年1月5日周二 下午6:47写道: > Hi Daniel, > > Flink commits transactions on checkpoints while Kafka Streams/connect >

Re: Flink SQL, temporal joins and backfilling data

2021-01-05 Thread Timo Walther
Hi Dan, are you sure that your watermarks are still correct during reprocessing? As far as I know, idle state retention is not used for temporal joins. The watermark indicates when state can be removed in this case. Maybe you can give us some more details about which kind of temporal join yo

Re: Comparing Flink vs Materialize

2021-01-05 Thread Arvid Heise
Hi Dan, I have not touched Materialize yet, but the picture of them is too simplifying. When you run Flink in parallel, then each source shard is assigned to one Flink source operator. Similarly, filter and map would run in parallel. Flink chains simple operators that have the same degree of paral

Re: Kafka SQL Connector Behavior (v1.11)

2021-01-05 Thread Arvid Heise
Hi Aeden, I just checked the code and your assumption is correct. Without an explicit partitioner, Flink just writes ProducerRecord without partition (null), so that whatever Kafka usually does applies. On Tue, Jan 5, 2021 at 1:53 AM Aeden Jameson wrote: > Based on these docs, > > https://ci.ap

Re: Batch with Flink Steraming API version 1.12.0

2021-01-05 Thread Arvid Heise
Hi Robert, you basically just (re)write your application with DataStream API, use the new KafkaSource, and then let the automatic batch detection mode work [1]. The most important part is that all your sources need to be bounded. Assuming that you just have a Kafka source, you need to use setBound

Re: Flink SQL - IntervalJoin doesn't support consuming update and delete - trying to deduplicate rows

2021-01-05 Thread Arvid Heise
Hi Dan, Which Flink version are you using? I know that there has been quite a bit of optimization of deduplication in 1.12, which would reduce the required state tremendously. I'm pulling in Jark who knows more. On Thu, Dec 31, 2020 at 6:54 AM Dan Hill wrote: > Hi! > > I'm using Flink SQL to do

Re: Flink Logging on EMR

2021-01-05 Thread Arvid Heise
Hi KristoffSC, taskmanager.out should only show the output of the process starting the taskmanager. In most cases, you probably want to look into taskmanager.log. On Tue, Dec 29, 2020 at 3:42 PM KristoffSC wrote: > Hi Mars, > Were you able to solve this problem? > > I'm facing exact same issue.

Re: Implementing a TarInputFormat based on FileInputFormat

2021-01-05 Thread Arvid Heise
Hi Billy, the exception is happening on the output side. Input side looks fine. Could you maybe post more information about the sink? On Mon, Dec 28, 2020 at 8:11 PM Billy Bain wrote: > I am trying to implement a class that will work similar to AvroFileFormat. > > This tar archive has a very sp

Re: Getting Exception in thread "main" java.util.concurrent.ExecutionException: scala.tools.reflect.ToolBoxError: reflective compilation has failed: cannot initialize the compiler

2021-01-05 Thread Arvid Heise
Hi Avi, without being a scala-guy, I'm guessing that you are mixing scala versions. Could you check that your user code uses the same scala version as Flink (1.11 or 1.12)? I have also heard of issues with different minor versions of scala, so make sure to use the exact same version (e.g. 2.11.12)

Re: read a tarred + gzipped file flink 1.12

2021-01-05 Thread Arvid Heise
Hi Billy, I suspect that it's not possible in Flink as is. The tar file acts as a directory containing an arbitrary number of files. Afaik, Flink assumes that all compressed files or just single files, like gz without tar. It's like this in your case, but then the tar part doesn't make much sense.

[CVE-2020-17518] Apache Flink directory traversal attack: remote file writing through the REST API

2021-01-05 Thread Robert Metzger
CVE-2020-17518: Apache Flink directory traversal attack: remote file writing through the REST API Vendor: The Apache Software Foundation Versions Affected: 1.5.1 to 1.11.2 Description: Flink 1.5.1 introduced a REST handler that allows you to write an uploaded file to an arbitrary location on the

[CVE-2020-17519] Apache Flink directory traversal attack: reading remote files through the REST API

2021-01-05 Thread Robert Metzger
CVE-2020-17519: Apache Flink directory traversal attack: reading remote files through the REST API Vendor: The Apache Software Foundation Versions Affected: 1.11.0, 1.11.1, 1.11.2 Description: A change introduced in Apache Flink 1.11.0 (and released in 1.11.1 and 1.11.2 as well) allows attackers

Re: Re: checkpoint delay consume message

2021-01-05 Thread Arvid Heise
There seems to be a double-post with the mail "Long latency when consuming a message from KAFKA and checkpoint is enabled". Let's continue discussion there. On Sun, Dec 27, 2020 at 8:33 AM nick toker wrote: > Hi, > Hi, We think we are using the default values unless we are missing > something.

Re: Long latency when consuming a message from KAFKA and checkpoint is enabled

2021-01-05 Thread Arvid Heise
Hi Nick, I'm not entirely sure that I understand your setup correctly. Basically, when enabling exactly once and checkpointing, Flink will only consume messages that have been committed. If you chain two Flink jobs with an intermediate Kafka topic, then the first Flink job will only commit messag

Re: Chaining 2 flink jobs through a KAFKA topic with checkpoint enabled

2021-01-05 Thread Arvid Heise
Hi Daniel, Flink commits transactions on checkpoints while Kafka Streams/connect usually commits on record. This is the typical tradeoff between Throughput and Latency. By decreasing the checkpoint interval in Flink, you can reach comparable latency to Kafka Streams. If you have two exactly once

Re: Throwing Recoverable Exceptions from Tasks

2021-01-05 Thread Arvid Heise
A typical solution to your issue is to use an ELK stack to collect the logs and define some filters on log events. If it's specific to input data issues, I also found side-outputs useful to store invalid data points. Then, you can simply monitor the side topic (assuming Kafka) and already have the

Re: Some questions about limit push down

2021-01-05 Thread Arvid Heise
This is most likely a bug, could you reiterate a bit how it is invalid? I'm also CCing Jark since he is one of the SQL experts. On Mon, Dec 28, 2020 at 10:37 AM Jun Zhang wrote: > when I query hive table by sql, like this `select * from hivetable where > id = 1 limit 1`, I found that the limit

Re: Flink 1.12.0 docker image MISSING

2021-01-05 Thread Chesnay Schepler
This is a known issue: FLINK-20632 On 1/5/2021 11:27 AM, Alexandru Vasiu wrote: Hi, Docker image for flink 1.12 is missing from Docker Hub. Thank you, Alex Vasiu ComplyAdvantage is a trading name of IVXS TECHNOLOGY ROMANIA. This message, i

Re: Realtime Data processing from HBase

2021-01-05 Thread Arvid Heise
Hi Sunitha, The current HBase connector only works continuously with Table API/SQL. If you use the input format, it only reads the data once as you have found out. What you can do is to implement your own source that repeatedly polls data and uses pagination or filters to poll only new data. You

Flink 1.12.0 docker image MISSING

2021-01-05 Thread Alexandru Vasiu
Hi, Docker image for flink 1.12 is missing from Docker Hub. Thank you, Alex Vasiu -- ComplyAdvantage is a trading name of IVXS TECHNOLOGY ROMANIA. This message, including any attachments, is intended only for the use of the individual(s) to whom it is addressed and may contain information tha

Re: [DISCUSS] FLIP-147: Support Checkpoints After Tasks Finished

2021-01-05 Thread Arvid Heise
Hi Yun, 1. I'd think that this is an orthogonal issue, which I'd solve separately. My gut feeling says that this is something we should only address for new sinks where we decouple the semantics of commits and checkpoints anyways. @Aljoscha Krettek any idea on this one? 2. I'm not sure I get it

Re: Flink+DDL读取kafka没有输出信息,但是kafka消费端有信息

2021-01-05 Thread Arvid Heise
Note that you posted to the english speaking mailing list. For the Chinese-speaking version please use user...@flink.apache.org. On Thu, Dec 24, 2020 at 3:39 PM Appleyuchi wrote: > 是Flink1.12的,kafka消费端能读取到数据,但是下面的代码无法读取到数据,运行后没有报错也没有输出, > 求助,谢谢 > > > import org.apache.flink.streaming.api.scala._

Re: using DefaultScalaModule

2021-01-05 Thread Arvid Heise
Hi Debasish, The idea of shading is actually to hide the dependencies of Flink from the user, such that he can use his own dependencies with appropriate versions. That means, you add jackson with jackson-module-scala into your application jar without worrying about Flink's jackson at all (just tr