Re: Re: Optimize exact deduplication for tens of billions data per day

2024-04-01 Thread Jeyhun Karimov
Hi Lei, In addition to the valuable suggested options above, maybe you can try to optimize your partitioning function (since you know your data). Maybe sample [subset of] your data if possible and/or check the key distribution, before re-defining your partitioning function. Regards, Jeyhun On Mo

Re: [ANNOUNCE] Apache Flink 1.19.0 released

2024-03-18 Thread Jeyhun Karimov
Congrats! Thanks to release managers and everyone involved. Regards, Jeyhun On Mon, Mar 18, 2024 at 9:25 AM Lincoln Lee wrote: > The Apache Flink community is very happy to announce the release of Apache > Flink 1.19.0, which is the fisrt release for the Apache Flink 1.19 series. > > Apache Fli

Re: Inquiry Regarding Flink Tumbling Window Persistence and Restart Handling for File Source

2023-12-04 Thread Jeyhun Karimov
Hi Arjun, Thanks for your query. Flink is fault tolerant and supports exactly-once semantics. In your case, the aggregated values can be recovered in case of a failure or application restart. You just need to enable checkpointing and configure an appropriate state backend. Regards, Jeyhun > > O

Re: Mixing Batch & Streaming

2016-03-04 Thread Jeyhun Karimov
Hi all, We are currently working on this issue to make efficient mixing between datastream window and dataset. However, the simplest solution would be, to output each window in a sequential file to HDFS and do computation on that datasource as dataset. On Fri, Mar 4, 2016 at 4:05 PM sskhiri w

Re: Stream conversion

2016-02-04 Thread Jeyhun Karimov
6 at 10:30 AM, Sane Lee wrote: >> >>> I have also, similar scenario. Any suggestion would be appreciated. >>> >>> On Thu, Feb 4, 2016 at 10:29 AM Jeyhun Karimov >>> wrote: >>> >>>> Hi Matthias, >>>> >>>> This need no

Re: Stream conversion

2016-02-04 Thread Jeyhun Karimov
Hi Matthias, This need not to be necessarily in api functions. I just want to get a roadmap to add this functionality. Should I save each window's data into disk and create a new dataset environment in parallel? Or change trigger functionality maybe? I have large windows. As I asked in previous q