arbitrary state handling in python api

2020-09-08 Thread Georg Heiler
Hi, does the python API expose some kind of mapGroupsWithState operator which can be applied on a window to handle arbitrary state? I would want to perform time-series anomaly detection using a streaming implementation of the matrix profile algorithm using https://github.com/TDAmeritrade/stumpy T

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Cristian
> The job sub directory will be cleaned up when the job > finished/canceled/failed. What does this mean? Also, to clarify: I'm a very sloppy developer. My jobs crash ALL the time... and yet, the jobs would ALWAYS resume from the last checkpoint. The only cases where I expect Flink to clean u

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Yang Wang
AFAIK, the HA data, including Zookeeper meta data and real data on DFS, will only be cleaned up when the Flink cluster reached terminated state. So if you are using a session cluster, the root cluster node on Zk will be cleaned up after you manually stop the session cluster. The job sub directory

Re: Difficulties with Minio state storage

2020-09-08 Thread Yangze Guo
Hi, Rex, I've tried to use MinIO as state backend and everything seems works well. Just sharing my configuration: ``` s3.access-key: s3.secret-key: s3.endpoint: http://localhost:9000 s3.path.style.access: true state.checkpoints.dir: s3://flink/checkpoints ``` I think the problem might be caused b

Difficulties with Minio state storage

2020-09-08 Thread Rex Fenley
Hello! I'm trying to test out Minio as state storage backend using docker-compose on my local machine but keep running into errors that seem strange to me. Any help would be much appreciated :) The problem: With the following environment: environment: - | FLINK_PROPERTIES= jobmanager.rpc.address

Re: State Storage Questions

2020-09-08 Thread Rex Fenley
Thanks a bunch! >For example, the Flink Kafka source operator's parallel instances maintain as operator state a mapping of partitions to offsets for the partitions that it is assigned to. This I think clarifies things. This is literally state for the operator to do its job, not really row data. T

How to get Latency Tracking results?

2020-09-08 Thread Pankaj Chand
Hello, How do I visualize (or extract) the results for Latency Tracking for a Flink local cluster? I set "metrics.latency.interval 100" in the conf/flink-conf.yaml file, and started the cluster and SocketWindowWordCount job. However, I could not find the latency distributions anywhere in the web U

Re: Flink 1.8.3 GC issues

2020-09-08 Thread Josson Paul
Hi Piotr, 2) SystemProcessingTimeService holds the HeapKeyedStateBackend and HeapKeyedStateBackend has lot of Objects and that is filling the Heap 3) I am not using Flink Kafka Connector. But we are using Apache Beam kafka connector. There is a change in the Apache Beam version. But the kafk

Re: Performance issue associated with managed RocksDB memory

2020-09-08 Thread Yun Tang
Hi Juha I planned to give some descriptions in Flink documentation to give such hints, however, it has too many details for RocksDB and we could increase the managed memory size to a proper value to avoid this in most cases. Since you have come across this and reported in user mailing list, and

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Cristian
I'm using the standalone script to start the cluster. As far as I can tell, it's not easy to reproduce. We found that zookeeper lost a node around the time this happened, but all of our other 75 Flink jobs which use the same setup, version and zookeeper, didn't have any issues. They didn't eve

Re: Should the StreamingFileSink mark the files "finished" when all bounded input sources are depleted?

2020-09-08 Thread Aljoscha Krettek
Hi, this is indeed the correct behaviour right now. Which doesn't mean that it's the behaviour that we would like to have. The reason why we can't move the "pending" files to "final" is that we don't have a point where we can do this in an idempotent and retryable fashion. When we do regular

Re:

2020-09-08 Thread Timo Walther
You are using the old connectors. The new connectors are available via SQL DDL (and execute_sql() API) like documented here: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/filesystem.html Maybe this will give your some performance boost, but certainly not eno

Re:

2020-09-08 Thread Violeta Milanović
Hi Timo, I actually tried many things, increasing jvm heap size and flink managed memory didn't help me. Running the same query without group by clause like this: select avg(transaction_amount) as avg_ta, avg(salary+bonus) as avg_income, avg(salary+bonus) - avg(transaction_amount) as

Re: Flink alert after database lookUp

2020-09-08 Thread Timo Walther
Hi Sunitha, what you are describing is a typical streaming enrichment. We need to enrich the stream with some data from a database. There are different strategies to handle this: 1) You are querying the database for every record. This is usually not what you want because it would slow down y

Re: FLINK DATASTREAM Processing Question

2020-09-08 Thread Timo Walther
Hi Vijay, one comment to add is that the performance might suffer with multiple map() calls. For safety reason, records between chained operators are serialized and deserialized in order to strictly don't influence each other. If all functions of a pipeline are guaranteed to not modify incomi

Re:

2020-09-08 Thread Timo Walther
Hi Violeta, can you share your connector code with us? The plan looks quite complicated given the relatively simple query. Maybe there is some optimization potential. But before we dive deeper, I see a `Map(to: Row)` which indicates that we might work with a legacy sink connector. Did you tr

Re: Performance issue associated with managed RocksDB memory

2020-09-08 Thread Juha Mynttinen
Hey Yun, Thanks for the detailed answer. It clarified how things work. Especially what is the role of RocksDB arena, and arena block size. I think there's no real-world case where it would make sense to start to a Flink job with RocksDB configured so that RocksDB flushes all the time, i.e. whe

Re: Checkpoint metadata deleted by Flink after ZK connection issues

2020-09-08 Thread Robert Metzger
Thanks a lot for reporting this problem here Cristian! I am not super familiar with the involved components, but the behavior you are describing doesn't sound right to me. Which entrypoint are you using? This is logged at the beginning, like this: "2020-09-08 14:45:32,807 INFO org.apache.flink.ru

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-09-08 Thread Dawid Wysakowicz
> The only one where I could see that users want different behaviour > BATCH jobs on the DataStream API. I agree that processing-time does > not make much sense in batch jobs. However, if users have written some > business logic using processing-time timers their jobs will silently > not work if we

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-09-08 Thread Aljoscha Krettek
I agree with almost all of your points! The only one where I could see that users want different behaviour BATCH jobs on the DataStream API. I agree that processing-time does not make much sense in batch jobs. However, if users have written some business logic using processing-time timers thei

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-09-08 Thread Dawid Wysakowicz
Hey Aljoscha A couple of thoughts for the two remaining TODOs in the doc: # Processing Time Support in BATCH/BOUNDED execution mode I think there are two somewhat orthogonal problems around this topic:     1. Firing processing timers at the end of the job     2. Having processing timers in the B