[Cloud] [Cloud-announce] [Incident] Toolforge: Ongoing intermittent DNS resolution issues

2024-11-26 Thread Slavina Stefanova
Hello,

We are currently investigating widespread intermittent DNS resolution
issues within the Toolforge Kubernetes cluster that began on Sunday. These
issues are causing some jobs and deployments to fail, particularly on NFS
worker nodes.

Impact:

   - Some tools may experience failed deployments or crashes
   - Job execution may be inconsistent
   - Image pulls may fail intermittently

Our team is actively investigating and working to resolve the issue. We
will send an update once we have more information or when the incident is
resolved.

Thank you for your patience,

Slavina, on behalf of the WMCS Team
--
Slavina Stefanova (she/her)
Software Engineer | Cloud Services

Wikimedia Foundation
___
Cloud-announce mailing list -- cloud-annou...@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/
___
Cloud mailing list -- cloud@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/


[Cloud] [Cloud-announce] Re: [Incident] Toolforge: Ongoing intermittent DNS resolution issues

2024-11-26 Thread Slavina Stefanova
Hello everyone,

Following up on this incident - the situation has stabilized following
control plane node reboots at around 10:25 UTC.

*Current status:*

   - No new DNS-related failures have been observed since the control plane
   reboots
   - Tool deployments and jobs are running normally

While we're still observing some underlying networking warnings, these are
not currently impacting service. We will continue monitoring the situation
and investigating the root cause to prevent future occurrences.

If you notice any DNS-related issues, please report them in the Phabricator
task: https://phabricator.wikimedia.org/T380844

Thank you for your patience during this incident.

Cheers,
WMCS Team
--
Slavina Stefanova (she/her)
Software Engineer | Cloud Services

Wikimedia Foundation


On Tue, Nov 26, 2024 at 10:30 AM Slavina Stefanova 
wrote:

> Hello,
>
> We are currently investigating widespread intermittent DNS resolution
> issues within the Toolforge Kubernetes cluster that began on Sunday. These
> issues are causing some jobs and deployments to fail, particularly on NFS
> worker nodes.
>
> Impact:
>
>- Some tools may experience failed deployments or crashes
>- Job execution may be inconsistent
>- Image pulls may fail intermittently
>
> Our team is actively investigating and working to resolve the issue. We
> will send an update once we have more information or when the incident is
> resolved.
>
> Thank you for your patience,
>
> Slavina, on behalf of the WMCS Team
> --
> Slavina Stefanova (she/her)
> Software Engineer | Cloud Services
>
> Wikimedia Foundation
>
___
Cloud-announce mailing list -- cloud-annou...@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/
___
Cloud mailing list -- cloud@lists.wikimedia.org
List information: 
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/