Handling data skewness

2024-08-16 Thread Karthick
Hi Team, I'm using keyBy to maintain field-based ordering across tasks, but I'm encountering data skewness among the task slots. I have 96 task slots, and I'm sending data with 500 distinct keys used in keyBy. While reviewing the Flink UI, I noticed that a few task slots are underutilized while ot

Re: Flink filesystem connector with regex support

2024-08-16 Thread amogh joshi
Hi Users, Any clues on configurable regex path for FilesSource/Filesystem connector for stream APIs is appreciated. Regards, Amogh. On Thu, 15 Aug, 2024, 11:18 amogh joshi, wrote: > Hi, > > I am building a pretty straightforward processing pipeline as described > below, using *DataStream* *AP

RE: Can we share states across tasks/operators

2024-08-16 Thread Schwalbe Matthias
Hi Christian, As I didn’t find a well-documented reference to MultipleInputStreamOperator, let me sketch it (it’s Scala code, easy enough to be transformed to Java): * Implements 4 inputs , the second is a broadcast stream * ACHEP, CEIEP, V1EP, AMAP are typedef of the respective input typ

Re: Integrating flink CDC with flink

2024-08-16 Thread Sachin Mittal
Ok. This makes sense. I will try it out. One more question in terms of performance which of the two connector would scan the existing collection faster. Say existing collection has 10 million records and in terms of storage size it is 1GB. Thanks Sachin On Fri, 16 Aug 2024 at 4:09 PM, Jiabao Sun

Re: Integrating flink CDC with flink

2024-08-16 Thread Jiabao Sun
Yes, you can use flink-connector-mongodb-cdc to process both existing and new data. See https://nightlies.apache.org/flink/flink-cdc-docs-release-3.1/docs/connectors/flink-sources/mongodb-cdc/#startup-reading-position Best, Jiabao On 2024/08/16 10:26:55 Sachin Mittal wrote: > Hi Jiabao, > My u

Re: Integrating flink CDC with flink

2024-08-16 Thread Sachin Mittal
Hi Jiabao, My usecase is that when I start my flink job it should load and process all the existing data in a collection and also wait and process any new data that comes along the way. As I notice that flink-connector-mongodb would process all the existing data, so do I still need this connector o

Re: Integrating flink CDC with flink

2024-08-16 Thread Jiabao Sun
Hi Sachin, flink-connector-mongodb supports batch reading and writing to MongoDB, similar to flink-connector-jdbc, while flink-connector-mongodb-cdc supports streaming MongoDB changes. If you need to stream MongoDB changes, you should use flink-connector-mongodb-cdc. You can refer to the fol

Integrating flink CDC with flink

2024-08-16 Thread Sachin Mittal
Hi, I have a scenario where I load a collection from MongoDB inside Flink using flink-connector-mongodb. What I additionally want is any future changes (insert/updates) to that collection is also streamed inside my Flink Job. What I was thinking of is to use a CDC connector to stream data to my Fl

OperatorStateFromBackend can't complete initialisation because of high number of savepoint files reads

2024-08-16 Thread William Wallace
Context We have recently upgraded from Flink 1.13.6 to Flink 1.19. We consume data from ~ 40k Kafka topic partitions in some environments. We are using aligned checkpoints. We set state.storage.fs.memory-threshold: 500kb. Problem At the point when the state for operator using topic-partition-off