Hello,
We run into the same problem. We've done most of the same
steps/observations:
- increase memory
- increase cpu
- No noticable increase in GC activity
- Little network io
Our current setup has the liveliness probe disabled and we've increased
(akka)timeouts, this seems to help
Hi Prakhar,
Sorry I don't have much experience on k8s. Maybe some other guys could help.
On Fri, Jul 26, 2019 at 6:20 PM Prakhar Mathur wrote:
> Hi,
>
> So we were deploying our flink clusters on YARN earlier but then we moved
> to kubernetes, but then our clusters were not this big. Have you g
Hi,
So we were deploying our flink clusters on YARN earlier but then we moved
to kubernetes, but then our clusters were not this big. Have you guys seen
issues with job manager rest server becoming irresponsive on kubernetes
before?
On Fri, Jul 26, 2019, 14:28 Biao Liu wrote:
> Hi Prakhar,
>
>
Hi Prakhar,
Sorry I could not find any abnormal message from your GC log and stack
trace.
Have you ever tried deploying the cluster in other ways? Not on Kubernetes.
Like on YARN or standalone. Just for narrowing down the scope.
On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur
wrote:
>
> On Mon
On Mon, Jul 22, 2019, 16:08 Prakhar Mathur wrote:
> Hi,
>
> We enabled GC logging, here are the logs
>
> [GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)]
> 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00,
> real=0.06 secs]
> [GC (Allocation Failure) [PSYo
Hi,
It seems to be good based on your GC metrics. You could double check the GC
log if you enable it. The GC log is more direct.
I'm not sure what's happening in your JobManager. But I'm pretty sure that
Flink could support far more larger scale cluster than yours.
Have you ever checked the log f
Hi Prakhar,
Have you ever checked the garbage collection of master?
Which version of Flink are you using? How many TaskManagers in your
cluster?
Prakhar Mathur 于2019年7月18日周四 下午1:54写道:
> Hello,
>
> We have deployed multiple Flink clusters on Kubernetess with 1 replica of
> Jobmanager and multip
Hello,
We have deployed multiple Flink clusters on Kubernetess with 1 replica of
Jobmanager and multiple of Taskmanager as per the requirement. Recently we
are observing that on increasing the number of Taskmanagers for a cluster,
the Jobmanager becomes irresponsive. It stops sending statsd metric