Please advise bootstrapping large state

2021-06-15 Thread Marco Villalobos
I must bootstrap state from postgres (approximately 200 GB of data) and I notice that the state processor API requires the DataSet API in order to bootstrap state for the Stream API. I wish there was a way to use the SQL API and use a partitioned scan, but I don't know if that is even possible wit

Flink SQL as DSL for flink CEP

2021-06-15 Thread Dipanjan Mazumder
Hi,     Can we say that Flink SQL is kind of a DSL overlay on flink CEP , i mean i need a DSL for flink CEP , so that i can decouple the CEP rules from code and pass them dynamically to be applied on different data streams. Flink CEP doen't have any DSL implementation , so is it that Flink SQL c

Re: How to gracefully handle job recovery failures

2021-06-15 Thread Li Peng
Understood, thanks all! -Li On Fri, Jun 11, 2021 at 12:40 AM Till Rohrmann wrote: > Hi Li, > > Roman is right about Flink's behavior and what you can do about it. The > idea behind its current behavior is the following: If Flink cannot recover > a job, it is very hard for it to tell whether it

Re: Re: Upgrade job topology in checkpoint

2021-06-15 Thread Padarn Wilson
We added a new sink to the job graph and redeployed - but the new sink did not receive any records, as though it were not connected to the graph (possible it was a code bug, but I was trying to understand if this make sense given the implementation) re-including mailing list, excluded by accident

Re: Re: Re: Re: Failed to cancel a job using the STOP rest API

2021-06-15 Thread Thomas Wang
Thanks everyone. I'm using Flink on EMR. I just updated to EMR 6.3 which uses Flink 1.12.1. I will report back whether this resolves the issue. Thomas On Wed, Jun 9, 2021 at 11:15 PM Yun Gao wrote: > Very thanks Kezhu for the catch, it also looks to me the same issue as > FLINK-21028. > > -

Re: Discard checkpoint files through a single recursive call

2021-06-15 Thread Jiahui Jiang
Hello Yun and Guowei, Thanks for the context! Looks like the plan is to have a Flink config flag to enable recursive deletion? Is there any plan to push through this PR in the next release? https://github.com/apache/flink/pull/9602 Thank you so much! Jiahui Fro

Re: is there a way to get the watermark per operator in tabular form?

2021-06-15 Thread JING ZHANG
Hi Jin, In Flink ui, there is already 'currentInputWatermark' [1] which represents the lowest watermark received by this task. Besides, Flink provided the following metrics about watermark. * currentInputWatermark [2], represents the lowest watermark received by this task * currentInputNWatermark

Re: after upgrade flink1.12 to flink1.13.1, flink web-ui's taskmanager detail page error

2021-06-15 Thread yidan zhao
does anyone has idea? Here I give another exception stack. Unhandled exception. org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Failed to serialize the result for RPC call : requestTaskManagerDetailsInfo. at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.serializeRemoteResultAndVe

Flink parameter configuration does not take effect

2021-06-15 Thread Jason Lee
Hi everyone, When I was researching and using Flink recently, I found that the official documentation on how to configure parameters is confusing, and when I set the parameters in some ways, it does not take effect. mainly as follows: we usually use a DDL Jar package to execute Flink SQL tas

Re: Discard checkpoint files through a single recursive call

2021-06-15 Thread Yun Tang
Hi Jiang, Please take a look at FLINK-17860 and FLINK-13856 for previous discussion of this problem. [1] https://issues.apache.org/jira/browse/FLINK-17860 [2] https://issues.apache.org/jira/browse/FLINK-13856 Best Yun Tang From: Guowei Ma Sent: Wednesday, June

Re: Resource Planning

2021-06-15 Thread Xintong Song
Hi Thomas, It would be helpful if you can provide the jobmanager/taskmanager logs, and gc logs if possible. Additionally, you may consider to monitor the cpu/memory related metrics [1], see if there's anything abnormal when the problem is observed. Thank you~ Xintong Song [1] https://ci.apach

Re: NPE when aggregate window.

2021-06-15 Thread Si-li Liu
The key used in the keyBy function. HaochengWang 于2021年6月12日周六 下午11:29写道: > Hi, > I meet the same exception, and find your suggestion here. I'm confused > about > the word 'grouping key', is that refers to the key of the accumulating hash > map, or the key that separate the stream by some inform

Re: Discard checkpoint files through a single recursive call

2021-06-15 Thread Guowei Ma
hi, Jiang I am afraid of misunderstanding what you mean, so can you elaborate on how you want to change it? For example, which interface or class do you want to add a method to? Although I am not a state expert, as far as I know, due to incremental checkpoints, when CompleteCheckpoint is discardin

Resource Planning

2021-06-15 Thread Thomas Wang
Hi, I'm trying to see if we have been given enough resources (i.e. CPU and memory) to each task node to perform a deduplication job. Currently, the job is not running very stable. What I have been observing is that after a couple of days run, we will suddenly see backpressure happen on one arbitra

is there a way to get the watermark per operator in tabular form?

2021-06-15 Thread Jin Yi
in the flink ui, is there a way to update the columns being shown to include the watermarks? in lieu of this, is it possible to query the watermarks throughout a flink job somehow? the rest api? thanks.

Re: S3 + Parquet credentials issue

2021-06-15 Thread Angelo G.
Thank you Svend and Till for your help. Sorry for the the late response. I'll try to give more information about the issue: I've not worked exactly in the situation you described, although I've had > to configure S3 access from a Flink application recently and here are a > couple of things I le

Save state on a CoGroupFunction and recover it after a failure

2021-06-15 Thread Felipe Gutierrez
Hi, I have a problem on my stream pipeline where the events on a CoGroupFunction are not restored after the application crashes. The application is like this: stream01.coGroup(stream02) .where(...).equalTo(...) .window(TumblingEventTimeWindows.of(1 minute)) .apply(new MyCoGroupFunction()) .proces

Re: RocksDB CPU resource usage

2021-06-15 Thread JING ZHANG
Hi Padarn, After switch stateBackend from filesystem to rocksdb, all reads/writes from/to backend have to go through de-/serialization to retrieve/store the state objects, this may cause more cpu cost. But I'm not sure it is the main reason leads to 3x CPU cost in your job. To find out the reason,

Checkpoint loading failure

2021-06-15 Thread Padarn Wilson
Hi all, We have a job that has a medium size state (around 4GB) and after adding a new part of the job graph (which should not impact the job too much) we found that every single checkpoint restore has the following error: Caused by: java.io.IOException: s3a://: Stream is closed! > at > org.apach

Re: PyFlink DataStream API problem

2021-06-15 Thread Dian Fu
It seems that there is something wrong during starting up the Python process. Have you installed Python 3 and also PyFlink in the docker image? Besides, you could take a look at the log of TaskManager and to see whether there are logs about the reason why the Python process starts up failed. R

Re: Upgrade job topology in checkpoint

2021-06-15 Thread Yun Gao
Hi Padarn, By default the checkpoint would be disposed when the job finished or failed, they would be retained only when explicitly required [1]. From the implementation perspective I think users could be able to change topology when restored from external checkpoint, but I think Flink would not

Re: TypeInfo issue with Avro SpecificRecord

2021-06-15 Thread Patrick Lucas
Alright, I figured it out—it's very similar to FLINK-13703, but instead of having to do with immutable fields, it's due to use of the Avro Gradle plugin option `gettersReturnOptional`. With this option, the generated code uses Optional for getters, but it's particularly useful with the option `opt

Re: 1.13.1 jobmanager annotations by pod template does not work

2021-06-15 Thread 陳昌倬
On Tue, Jun 15, 2021 at 04:40:00PM +0800, Yang Wang wrote: > Yes. It is the by-design behavior. Because the pod template is only > applicable to the "pod", not other resources(e.g. deployment, configmap). > > Currently, the JobManager pod is managed by deployment and the naked > TaskManager pods a

RocksDB CPU resource usage

2021-06-15 Thread Padarn Wilson
Hi all, We have a job that we just enabled rocksdb on (instead of file backend), and see that the CPU usage is almost 3x greater on (we had to increase taskmanagers 3x to get it to run. I don't really understand this, is there something we can look at to understand why CPU use is so high? Our sta

Re: Flink job restart when one ZK node is down

2021-06-15 Thread Yang Wang
It is a known issue. And please refer to FLINK-10052[1] for more information. [1]. https://issues.apache.org/jira/browse/FLINK-10052 Best, Yang yidan zhao 于2021年6月15日周二 下午3:43写道: > Yes it is expected, I have also met such problems. > > Lu Niu 于2021年6月15日周二 上午4:53写道: > > > > HI, Flink Users >

Re: 1.13.1 jobmanager annotations by pod template does not work

2021-06-15 Thread Yang Wang
Yes. It is the by-design behavior. Because the pod template is only applicable to the "pod", not other resources(e.g. deployment, configmap). Currently, the JobManager pod is managed by deployment and the naked TaskManager pods are managed by Flink ResourceManager. This is the root cause which mak

Re: 1.13.1 jobmanager annotations by pod template does not work

2021-06-15 Thread 陳昌倬
On Tue, Jun 15, 2021 at 04:22:07PM +0800, Yang Wang wrote: > The annotations, and labels in the pod template will only apply to the > JobManager pod, not the JobManager deployment. Thanks for the information. Is this behavior by design? In document, it looks like there is no different between job

Re: 1.13.1 jobmanager annotations by pod template does not work

2021-06-15 Thread Yang Wang
The annotations, and labels in the pod template will only apply to the JobManager pod, not the JobManager deployment. Best, Yang ChangZhuo Chen (陳昌倬) 于2021年6月11日周五 下午11:44写道: > On Fri, Jun 11, 2021 at 11:19:09PM +0800, Yang Wang wrote: > > Could you please share your pod template and the value

Re: Flink job restart when one ZK node is down

2021-06-15 Thread yidan zhao
Yes it is expected, I have also met such problems. Lu Niu 于2021年6月15日周二 上午4:53写道: > > HI, Flink Users > > We use a Zk cluster of 5 node for JM HA. When we terminate one node for > maintenance, we notice lots of flink job fully restarts. The error looks like: > ``` > org.apache.flink.util.FlinkEx

PyFlink DataStream API problem

2021-06-15 Thread László Ciople
Hello, I am experimenting with the Python DataStream API in Flink 1.13, in order to confirm that it is a viable fit for our needs, basically trying to prove that what can be done in the Java DataStream API also works in Python. During testing of a processing pipeline, I encountered a problem at the

Re: How to know (in code) how many times the job restarted?

2021-06-15 Thread Felipe Gutierrez
So, I was trying to improve by using the CheckpointedFunction as it shows here [1]. But the method isRestored() says in its documentation [2]: "Returns true, if state was restored from the snapshot of a previous execution. This returns always false for stateless tasks." It is weird because I am e