I know this seems a silly question but I am trying to figure out optimal set up for our flink jobs. We are using standalone cluster with 5 jobs. Each job has 3 asynch operators with Executors with thread counts of 20,20,100. Source is kafka and cassandra and rest sinks exist. Currently we are using parallelism = 1. So at max load a single job spans at least 140 threads. Also we are using netty based libraries for cassandra and restcalls . (As I can see in thread dump flink also uses netty server). What we see is that total thread count adds up to ~ 500 for a single job.
Suddenly all jobs began to faıl ın production and we saw that it was mainly due to ulimit user process. All jobs started in one server in cluster ( I do not know why, as it is a cluster with 3 members) It was set to around 1500 in that server. We then set a higher value and problems seem to go away. Can you recommend an optional prod setting for standalone cluster? Or should there be a max limit on threads spawned by a single job? Regards -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/