Re: Nondeterministic results with SQL job when parallelism is > 1

2021-04-16 Thread Jark Wu
Hi Dylan, The primary key ordering problem I mean above is about changelog. Batch queries only emit a final result, and thus don't have changelog, so it's safe to use batch mode. The problem only exists in streaming mode with more than 1 parallelism. Best, Jark On Fri, 16 Apr 2021 at 21:40, Dyl

Re: Multiple select queries in single job on flink table API

2021-04-16 Thread Yuval Itzchakov
Yes. Instead of calling execute on each table, create a StatementSet using your StreamTableEnvironment (tableEnv.createStatementSet) and use addInsert and finally .execute when you want to run the job. On Sat, Apr 17, 2021, 03:20 tbud wrote: > If I want to run two different select queries on

Multiple select queries in single job on flink table API

2021-04-16 Thread tbud
If I want to run two different select queries on a flink table created from the dataStream, the blink-planner runs them as two different jobs. Is there a way to combine them and run as a single job ? Example code : /StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironmen

Re: [External] : Re: Conflict in the document - About native Kubernetes per job mode

2021-04-16 Thread Fuyao Li
Hello Yang, Please take a look at the PR when you are free. https://github.com/apache/flink/pull/15602 Should be a simple change. Thanks! Best,

Flink Statefun Python Batch

2021-04-16 Thread Timothy Bess
Hi everyone, Is there a good way to access the batch of leads that Statefun sends to the Python SDK rather than processing events one by one? We're trying to run our data scientist's machine learning model through the SDK, but the code is very slow when we do single events and we don't get many of

Re: 2-phase commit and kafka

2021-04-16 Thread Vishal Santoshi
Thanks for the feedback and. glad I am on the right track. > Outstanding transactions should be automatically aborted on restart by Flink. Let me understand this 1. Flink pipe is cancelled and has dangling kafka transactions. 2. A new Flink pipe ( not restored from a checkpoint or sp ) is start

Re: Iterate Operator Checkpoint Failure

2021-04-16 Thread Lu Niu
Hi, Fabian Thanks for replying. I created this ticket. It contains how to reproduce it using code in flink-example package: https://issues.apache.org/jira/browse/FLINK-22326 Best Lu On Fri, Apr 16, 2021 at 1:25 AM Fabian Paul wrote: > Hi Lu, > > Can you provide some more detailed logs of what

Primary key preservation in table API select?

2021-04-16 Thread Brad Davis
I'm trying to write a relatively simple plan using the table API, and I'm getting horrific performance on my joins. I discovered after looking at the execution plan in the web UI that a number of the joins had NoUniqueKey on one or both sides of the join. I couldn't understand this as all of my t

Re: 2-phase commit and kafka

2021-04-16 Thread Arvid Heise
Hi Vishal, I think you pretty much nailed it. Outstanding transactions should be automatically aborted on restart by Flink. Flink (re)uses a pool of transaction ids, such that all possible transactions by Flink are canceled on restart. I guess the biggest downside of using a large transaction ti

proper way to manage watermarks with messages combining multiple timestamps

2021-04-16 Thread Mathieu D
Hello, I'm totally new to Flink, and I'd like to make sure I understand things properly around watermarks. We're processing messages from iot devices. Those messages have a timestamp, and we have a first phase of processing based on this timestamp. So far so good. These messages actually "pack"

2-phase commit and kafka

2021-04-16 Thread Vishal Santoshi
Hello folks So AFAIK data loss on exactly once will happen if - start a transaction on kafka. - pre commit done ( kafka is prepared for the commit ) - commit fails ( kafka went own or n/w issue or what ever ). kafka has an uncommitted transaction - pipe was down for

Re: Nondeterministic results with SQL job when parallelism is > 1

2021-04-16 Thread Dylan Forciea
Jark, Thanks for the heads up! I didn’t see this behavior when running in batch mode with parallelism turned on. Is it safe to do this kind of join in batch mode right now, or am I just getting lucky? Dylan From: Jark Wu Date: Friday, April 16, 2021 at 5:10 AM To: Dylan Forciea Cc: Timo Walt

Re: PyFlink UDF: When to use vectorized vs scalar

2021-04-16 Thread Fabian Paul
Hi Yik San, I think the usage of vectorized udfs highly depends on your input and output formats. For your example my first impression would say that parsing a JSON string is always an rather expensive operation and the vectorization has not much impact on that. I am ccing Dian Fu who is more

Flink Savepoint fault tolerance

2021-04-16 Thread dhanesh arole
Hello all, I had 2 questions regarding savepoint fault tolerance. Job manager restart: - Currently, we are triggering savepoints using REST apis. And query the status of savepoint by the returned handle. In case there is a network issue because of which we couldn't receive response then

Re: Question about state processor data outputs

2021-04-16 Thread Chen-Che Huang
Hi Robert, Due to some concerns, we planned to use state processor to achieve our goal. Now we will consider to reevaluate using datastream to do the job while exploring the possibility of implementing a custom FileOutputFormat. Thanks for your comments! Best wishes, Chen-Che Huang On 2021/0

Re: Nondeterministic results with SQL job when parallelism is > 1

2021-04-16 Thread Jark Wu
HI Dylan, I think this has the same reason as https://issues.apache.org/jira/browse/FLINK-20374. The root cause is that changelogs are shuffled by `attr` at second join, and thus records with the same `id` will be shuffled to different join tasks (also different sink tasks). So the data arrived at

PyFlink UDF: When to use vectorized vs scalar

2021-04-16 Thread Yik San Chan
The question is cross-posted on Stack Overflow https://stackoverflow.com/questions/67122265/pyflink-udf-when-to-use-vectorized-vs-scalar Is there a simple set of rules to follow when deciding between vectorized vs scalar PyFlink UDF? According to [docs]( https://ci.apache.org/projects/flink/flink

Re: java.io.StreamCorruptedException: unexpected block data

2021-04-16 Thread Alokh P
The flink version is 1.12.1 On Fri, Apr 16, 2021 at 1:59 PM Alokh P wrote: > Hi Community, > Facing this error when trying to query Parquet data using flink SQL Client > > Create Table command > > CREATE TABLE test( > `username` STRING, > `userid` INT) WITH ('connector' = 'filesystem', 'pat

java.io.StreamCorruptedException: unexpected block data

2021-04-16 Thread Alokh P
Hi Community, Facing this error when trying to query Parquet data using flink SQL Client Create Table command CREATE TABLE test( `username` STRING, `userid` INT) WITH ('connector' = 'filesystem', 'path' = '/home/centos/test/0016_part_00.parquet', 'format' = 'parquet' ); Select command : s

Re: Iterate Operator Checkpoint Failure

2021-04-16 Thread Fabian Paul
Hi Lu, Can you provide some more detailed logs of what happened during the checkpointing phase? If it is possible please enable debug logs enabled. It would be also great know whether you have implemented your own Iterator Operator or what kind of Flink program you are trying to execute. Best,

Re:Flink streaming job's taskmanager process killed by yarn nodemanager because of exceeding 'PHYSICAL' memory limit

2021-04-16 Thread 马阳阳
The Flink version we used is 1.12.0. | | 马阳阳 | | ma_yang_y...@163.com | 签名由网易邮箱大师定制 On 04/16/2021 16:07,马阳阳 wrote: Hi, community, When running a Flink streaming job with big state size, one task manager process was killed by the yarn node manager. The following log is from the yarn node manag

Flink streaming job's taskmanager process killed by yarn nodemanager because of exceeding 'PHYSICAL' memory limit

2021-04-16 Thread 马阳阳
Hi, community, When running a Flink streaming job with big state size, one task manager process was killed by the yarn node manager. The following log is from the yarn node manager: 2021-04-16 11:51:23,013 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonito

Re: Flink Hadoop config on docker-compose

2021-04-16 Thread Flavio Pompermaier
Hi Yang, isn't this something to fix? If I look at the documentation at [1], in the "Passing configuration via environment variables" section, there is: "The environment variable FLINK_PROPERTIES should contain a list of Flink cluster configuration options separated by new line, the same way as i