parallelism is 100. I tried clusters with 1 and 2 slots per TM yielding 100 or 50 TMs in cluster.
I did notice that URL http://jobmanager:port/jobs/job_id/metrics in 1.5.x returns huge list of "latency.source_id. ...." IDs. Heap dump shows that hash map takes 1.6GB for me. I am guessing that is the one dispatcher threads keep updating. Not sure what are those. In 1.4.0 that URL returns something else, very short list. On Thu, Aug 23, 2018 at 6:44 PM Piotr Nowojski <pi...@data-artisans.com> wrote: > Hi, > > How many task slots do you have in the cluster and per machine, and what > parallelism are you using? > > Piotrek > > > On 23 Aug 2018, at 16:21, Jozef Vilcek <jozo.vil...@gmail.com> wrote: > > > > Yes, on smaller data and therefore smaller resources and parallelism > > exactly same job runs fine > > > > On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljos...@apache.org> > wrote: > > > >> Hi, > >> > >> So with Flink 1.5.3 but a smaller parallelism the job works fine? > >> > >> Best, > >> Aljoscha > >> > >>> On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vil...@gmail.com> wrote: > >>> > >>> Hello, > >>> > >>> I am trying to get my Beam application (run on newer version of Flink > >>> (1.5.3) but having trouble with that. When I submit application, > >> everything > >>> works fine but after a few mins (as soon as 2 minutes after job start) > >>> cluster just goes bad. Logs are full of timeouts for heartbeats, > >> JobManager > >>> lost leadership, TaskExecutor timed out etc. > >>> > >>> At that time, also WebUI is not usable. Looking into job manager, I did > >>> notice that all of "flink-akka.actor.default-dispatcher" threads are > busy > >>> or blocked. Most blocks are on metrics: > >>> > >>> ======================================= > >>> java.lang.Thread.State: BLOCKED (on object monitor) > >>> at > >>> > >> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84) > >>> - waiting to lock <0x000000053df75510> (a > >>> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) > >>> at > >>> > >> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205) > >>> at > >>> > >> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown > >>> Source) > >>> at > >>> > >> > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > >>> ... > >>> ======================================= > >>> > >>> I tried to increase memory, as MetricStore seems to hold quite a lot > >> stuff, > >>> but it is not helping. On 1.4.0 job manager was running with 4GB heap, > >> now, > >>> this behaviour also occur with 10G. > >>> > >>> Any suggestions? > >>> > >>> Best, > >>> Jozef > >>> > >>> P.S.: Executed Beam app has problem in setup with 100 parallelism, 100 > >> task > >>> slots, 2100 running task, streaming mode. Smaller job runs without > >> problem > >> > >> > >