Hi Mason Can you use the jvm cpu perfrommance analysis tools?
Jprofile and https://github.com/alibaba/arthas <https://github.com/alibaba/arthas> You can probably guess the reason for the high CPU load. Jake > On Aug 6, 2020, at 12:25 PM, Chen, Mason <mason.c...@sony.com> wrote: > > Thanks Peter for the reply. I noticed the behavior you described when I > reduced the parallelism of the asyncio sink to 8—one task manager had its > slots completely taken and the other one had all its slots completely open. > To mitigate this behavior, I tried to use the setting > `cluster.evenly-spread-out-slots: true`, but it didn’t fix anything (had > expected the job manager to split the task slot requirements evenly between > the two task managers). It seems like in general I should be extremely wary > of the parallelism and number of task slots, and their effects on the > cpu/memory usage… > > I will use your work around to use parallelism of 8—I can scale the capacity > of the asyncio accordingly, no problem there. For the filter function, I kept > it at 4 since there’s a cache involved and I noticed that hit rate was worse > when the parallelism was higher—I will use a keyBy to mitigate this. > > From: Piotr Nowojski <piotr.nowoj...@gmail.com > <mailto:piotr.nowoj...@gmail.com>> > Date: Wednesday, August 5, 2020 at 10:36 AM > To: "Chen, Mason" <mason.c...@sony.com <mailto:mason.c...@sony.com>> > Cc: "user@flink.apache.org <mailto:user@flink.apache.org>" > <user@flink.apache.org <mailto:user@flink.apache.org>> > Subject: Re: Only One TaskManager Showing High CPU Usage > > Hi, > > What I guess is happening is since you have 16 slots in total (8 slots per > TM), while your operators have various levels of parallelism (8, 4, 16), > Flink is scheduling all of the operators with parallelism < 16 on a TM that > becomes available first to the scheduler. That's causing the visible load > skew. Keep in mind that different operators are by default allowed to share > the same task slot, unless you explicitly tell them to not do that [1]. > > One obvious work around would be to define the same parallelism for all of > the operators, and that's the usual way to go, unless you have a really good > reason not to. Can you try this out? Usually there is no harm in keeping more > then required operator instances, and in your case you already have the > highest parallelism in your Async function (the one that allocates the most > resources?). > > Till, is there a way to change this resource allocation/scheduling behaviour? > To not pack everything on the same TM? > > Piotrek > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/operators/#task-chaining-and-resource-groups > > <https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/operators/#task-chaining-and-resource-groups> > > > > śr., 5 sie 2020 o 02:39 Chen, Mason <mason.c...@sony.com > <mailto:mason.c...@sony.com>> napisał(a): > Hi all, > > The issue is that only one out of two taskmanagers experience high cpu > usage.<image001.png> > > I’m running a series of performance tests processing records at 50k rps. In > this setup, I have 1 job manager (1 core, 1 gb) and 2 task managers (8 cores, > 8 gb). Each of the taskmanagers have 8 task slots and we have a simple > pipeline that reads from kafka, filters, and makes a http request downstream > with the asyncio function. > > All operators have parallelism of 8, except the filter (parallelism of 4) and > the asyncio function (parallelism of 16). We do not have checkpointing turned > on. > > I thought maybe the operator chaining was causing issues in distributing the > load, so I disabled operator chaining after the filter (before the asyncio). > However, the issue still persisted and I did see somewhat even distribution > of records before and after this change. > > Some potential problems: the http client is not static so it will be > recreated for each parallel instance of the asyncio operator (so, there’s > gonna be a lot of executors.). At the cpu peak, I see 10k threads and it > steadily grows to 40k at the end of the time period shown. > > > Does anyone have any ideas? In the 50k rps, about 500 out of those events > need to hit the asyncio function (the filter filters out the unrelated > events). I was doing fine before I added the unrelated events (just the 500 > rps going to asyncio). > > Thanks, > Mason