Have you tried logging in to the pod to see if you can tell anything about what's going on? Process is described somewhat here:
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes I.e. find what the name of your "pod" is: 1. kubectl get pods Then log in to it: 2. kubectl exec -it <podname> -- /bin/bash Arthur On Sun, Jan 26, 2020 at 10:57 AM Russell Blau <russb...@imapmail.org> wrote: > TL;DR: The lighttpd webservice for https://tools.wmflabs.org/dplbot/ > fails repeatedly, frequently, and unpredictably, and I have been unable to > diagnose any cause. > > Currently, tools.dplbot is running a php7.2 webservice on the kubernetes > backend; however, the failures started occurring when it was running > lighttpd on the job grid, and the move to kubernetes does not seem to have > changed anything in this respect. The tool serves a variety of PHP-based > pages which generate reports from the Toolforge database replicas. > > The symptom of failure is that all requests get rejected with 503 service > unavailable. The lighttpd process continues to run (which is why I am > calling this a "failure" rather than a "crash"), which means kubernetes > doesn't detect any problem and doesn't restart the server, but the server > does not respond to any requests. The "webservice status" command claims > that the webservice is still running. Every time this happens, I have to > restart the webservice. The webservice appears to fail immediately after > some restarts, while in other cases it runs normally for a period of time, > which is highly variable (minutes to hours) and then fails again. > > Even more frustrating than the constant failures is the lack of any > information to allow diagnosing the cause of this. The error.log file > (/data/project/dplbot/error.log) does not show any error messages > corresponding to the times of failures. I tried various lighttpd debugging > options, and none of these gave me anything useful. They appear to show all > requests being handled normally, and no debug information at all at or or > after the point of failure. I also reactivated access logging > (/data/project/dplbot/access.log), and this only shows requests that were > handled correctly. In other words, there is no log indicating a request > that came in at/just before a failure without a corresponding response > going out. > > If these failures were being caused spontaneously by some problem in > lighttpd or in the Toolforge infrastructure, I would expect other users to > be affected by them, but that doesn't seem to be the case. > > This has previously been reported at > https://phabricator.wikimedia.org/T115231 (including more detail on the > debug options I tried), where frankly I have received absolutely no > assistance. I did receive one mildly helpful comment from bd808 on a > related issue (https://phabricator.wikimedia.org/T218915), as follows: > > ... [It is] possible to have a Kubernetes powered webservice become > unresponsive to client requests due to an internal deadlock or resource > exhaustion issue in the application which does not also lead to a crash of > the lighttpd process itself. > > > However, if there is an internal deadlock or resource exhaustion issue in > the underlying PHP scripts, I would expect some error message in the logs, > which isn't there. Also, during a recent interval when the server was up > for a while, I took the time to click every single link on > https://tools.wmflabs.org/dplbot/, and the server responded to every one > of them, so there does not seem to be a fatal bug in any of the scripts > (although this exercise revealed a few minor issues). > > I'm not necessarily looking for someone to solve this problem for me > (although that would be nice :-) ), but just some ideas about how to > identify potential causes. Right now it is basically a black hole; no > information whatsoever is coming out of the webserver at the point of > failure, so I can make no progress. > > -- > Russell Blau > russb...@imapmail.org > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud