Yes, on smaller data and therefore smaller resources and parallelism exactly same job runs fine
On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljos...@apache.org> wrote: > Hi, > > So with Flink 1.5.3 but a smaller parallelism the job works fine? > > Best, > Aljoscha > > > On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vil...@gmail.com> wrote: > > > > Hello, > > > > I am trying to get my Beam application (run on newer version of Flink > > (1.5.3) but having trouble with that. When I submit application, > everything > > works fine but after a few mins (as soon as 2 minutes after job start) > > cluster just goes bad. Logs are full of timeouts for heartbeats, > JobManager > > lost leadership, TaskExecutor timed out etc. > > > > At that time, also WebUI is not usable. Looking into job manager, I did > > notice that all of "flink-akka.actor.default-dispatcher" threads are busy > > or blocked. Most blocks are on metrics: > > > > ======================================= > > java.lang.Thread.State: BLOCKED (on object monitor) > > at > > > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84) > > - waiting to lock <0x000000053df75510> (a > > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) > > at > > > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205) > > at > > > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown > > Source) > > at > > > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > ... > > ======================================= > > > > I tried to increase memory, as MetricStore seems to hold quite a lot > stuff, > > but it is not helping. On 1.4.0 job manager was running with 4GB heap, > now, > > this behaviour also occur with 10G. > > > > Any suggestions? > > > > Best, > > Jozef > > > > P.S.: Executed Beam app has problem in setup with 100 parallelism, 100 > task > > slots, 2100 running task, streaming mode. Smaller job runs without > problem > >