TL;DR: The lighttpd webservice for https://tools.wmflabs.org/dplbot/ fails 
repeatedly, frequently, and unpredictably, and I have been unable to diagnose 
any cause.

Currently, tools.dplbot is running a php7.2 webservice on the kubernetes 
backend; however, the failures started occurring when it was running lighttpd 
on the job grid, and the move to kubernetes does not seem to have changed 
anything in this respect. The tool serves a variety of PHP-based pages which 
generate reports from the Toolforge database replicas.

The symptom of failure is that all requests get rejected with 503 service 
unavailable. The lighttpd process continues to run (which is why I am calling 
this a "failure" rather than a "crash"), which means kubernetes doesn't detect 
any problem and doesn't restart the server, but the server does not respond to 
any requests. The "webservice status" command claims that the webservice is 
still running. Every time this happens, I have to restart the webservice. The 
webservice appears to fail immediately after some restarts, while in other 
cases it runs normally for a period of time, which is highly variable (minutes 
to hours) and then fails again.

Even more frustrating than the constant failures is the lack of any information 
to allow diagnosing the cause of this. The error.log file 
(/data/project/dplbot/error.log) does not show any error messages corresponding 
to the times of failures. I tried various lighttpd debugging options, and none 
of these gave me anything useful. They appear to show all requests being 
handled normally, and no debug information at all at or or after the point of 
failure. I also reactivated access logging (/data/project/dplbot/access.log), 
and this only shows requests that were handled correctly. In other words, there 
is no log indicating a request that came in at/just before a failure without a 
corresponding response going out.

If these failures were being caused spontaneously by some problem in lighttpd 
or in the Toolforge infrastructure, I would expect other users to be affected 
by them, but that doesn't seem to be the case. 

This has previously been reported at https://phabricator.wikimedia.org/T115231 
(including more detail on the debug options I tried), where frankly I have 
received absolutely no assistance. I did receive one mildly helpful comment 
from bd808 on a related issue (https://phabricator.wikimedia.org/T218915), as 
follows:

> ... [It is] possible to have a Kubernetes powered webservice become 
> unresponsive to client requests due to an internal deadlock or resource 
> exhaustion issue in the application which does not also lead to a crash of 
> the lighttpd process itself.

However, if there is an internal deadlock or resource exhaustion issue in the 
underlying PHP scripts, I would expect some error message in the logs, which 
isn't there. Also, during a recent interval when the server was up for a while, 
I took the time to click every single link on 
https://tools.wmflabs.org/dplbot/, and the server responded to every one of 
them, so there does not seem to be a fatal bug in any of the scripts (although 
this exercise revealed a few minor issues).

I'm not necessarily looking for someone to solve this problem for me (although 
that would be nice :-) ), but just some ideas about how to identify potential 
causes. Right now it is basically a black hole; no information whatsoever is 
coming out of the webserver at the point of failure, so I can make no progress.

-- 
 Russell Blau
 russb...@imapmail.org
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to