Have you tried logging in to the pod to see if you can tell anything about
what's going on? Process is described somewhat here:

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes

I.e. find what the name of your "pod" is:
1. kubectl get pods

Then log in to it:
2. kubectl exec -it <podname> -- /bin/bash

   Arthur

On Sun, Jan 26, 2020 at 10:57 AM Russell Blau <russb...@imapmail.org> wrote:

> TL;DR: The lighttpd webservice for https://tools.wmflabs.org/dplbot/
> fails repeatedly, frequently, and unpredictably, and I have been unable to
> diagnose any cause.
>
> Currently, tools.dplbot is running a php7.2 webservice on the kubernetes
> backend; however, the failures started occurring when it was running
> lighttpd on the job grid, and the move to kubernetes does not seem to have
> changed anything in this respect.  The tool serves a variety of PHP-based
> pages which generate reports from the Toolforge database replicas.
>
> The symptom of failure is that all requests get rejected with 503 service
> unavailable.  The lighttpd process continues to run (which is why I am
> calling this a "failure" rather than a "crash"), which means kubernetes
> doesn't detect any problem and doesn't restart the server, but the server
> does not respond to any requests. The "webservice status" command claims
> that the webservice is still running. Every time this happens, I have to
> restart the webservice.  The webservice appears to fail immediately after
> some restarts, while in other cases it runs normally for a period of time,
> which is highly variable (minutes to hours) and then fails again.
>
> Even more frustrating than the constant failures is the lack of any
> information to allow diagnosing the cause of this.  The error.log file
> (/data/project/dplbot/error.log) does not show any error messages
> corresponding to the times of failures. I tried various lighttpd debugging
> options, and none of these gave me anything useful. They appear to show all
> requests being handled normally, and no debug information at all at or or
> after the point of failure. I also reactivated access logging
> (/data/project/dplbot/access.log), and this only shows requests that were
> handled correctly. In other words, there is no log indicating a request
> that came in at/just before a failure without a corresponding response
> going out.
>
> If these failures were being caused spontaneously by some problem in
> lighttpd or in the Toolforge infrastructure, I would expect other users to
> be affected by them, but that doesn't seem to be the case.
>
> This has previously been reported at
> https://phabricator.wikimedia.org/T115231 (including more detail on the
> debug options I tried), where frankly I have received absolutely no
> assistance.  I did receive one mildly helpful comment from bd808 on a
> related issue (https://phabricator.wikimedia.org/T218915), as follows:
>
> ... [It is] possible to have a Kubernetes powered webservice become
> unresponsive to client requests due to an internal deadlock or resource
> exhaustion issue in the application which does not also lead to a crash of
> the lighttpd process itself.
>
>
> However, if there is an internal deadlock or resource exhaustion issue in
> the underlying PHP scripts, I would expect some error message in the logs,
> which isn't there.  Also, during a recent interval when the server was up
> for a while, I took the time to click every single link on
> https://tools.wmflabs.org/dplbot/, and the server responded to every one
> of them, so there does not seem to be a fatal bug in any of the scripts
> (although this exercise revealed a few minor issues).
>
> I'm not necessarily looking for someone to solve this problem for me
> (although that would be nice :-) ), but just some ideas about how to
> identify potential causes. Right now it is basically a black hole; no
> information whatsoever is coming out of the webserver at the point of
> failure, so I can make no progress.
>
> --
>   Russell Blau
>   russb...@imapmail.org
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to