It's probably worth noting that these jobs were originally run with taskmanager.numberOfTaskSlots of 1 before increasing to 2, which may also explain the issue. I thought I'd mention it for context in case it's relevant.
On Mon, Dec 9, 2024 at 8:13 AM Rion Williams <rionmons...@gmail.com> wrote: > Hi all, > > In trying to optimize the performance of some of the existing Flink jobs > that are running in production environments, I've recently done some > experimenting with taking advantage of the taskmanager.numberOfTaskSlots > configuration for some of my Flink jobs and noticed an issue. > > It appears when this configuration is set to a value greater than 1, the > Kafka-based producers will fail with the following error which appears to > be directly related to the configuration changes themselves: > > org.apache.kafka.common.errors.TimeoutException: Timeout expired after >> 60000ms while awaiting InitProducerId > > > I searched through the existing Apache JIRA project to try and identify a > similar documented issue, however I didn't find anything that directly > pointed to it. Is this a potential issue/bug or is this the expected > behavior (i.e. numberOfTaskSlots must be 1 to support this). > > Any advice would be appreciated! > > Thanks, > > Rion >