On Thu, Feb 10, 2022 at 1:11 PM Aleix Pol <aleix...@kde.org> wrote: > > On Thu, Feb 10, 2022 at 11:05 AM Ben Cooksley <bcooks...@kde.org> wrote: > > > > > > > > On Thu, Feb 10, 2022 at 8:20 AM Aleix Pol <aleix...@kde.org> wrote: > >> > >> [Snip] > >> > >> We still haven't discussed here is how to prevent this problem from > >> happening again. > >> > >> If we don't have information about what is happening, we cannot fix > >> problems. > > > > > > Part of the issue here is that the problem only came to Sysadmin attention > > very recently, when the system ran out of disk space as a result of growing > > log files. > > It was at that point we realised we had a serious problem. > > > > Prior to that the system load hadn't climbed to dangerous levels (> number > > of CPU cores) and Apache was keeping up with the traffic, so none of our > > other monitoring was tripped. > > > > If you have any thoughts on what sort of information you are thinking of > > that would be helpful. > > We could have plots of the amount of queries we get with a KNewStuff/* > user-agent over time and their distribution. > > > It would definitely be helpful though to know when new software is going to > > be released that will be interacting with the servers as we will then be > > able to monitor for abnormalities. > > We make big announcements of every Plasma release... (?) > > >> Is there anything that could be done in this front? The issue here > >> could have been addressed months ago, we just never knew it was > >> happening. > > > > > > One possibility that did occur to me today would be for us to integrate > > some kind of killswitch that our applications would check on first > > initialisation of functionality that talks to KDE.org servers. > > This would allow us to disable the functionality in question on user > > systems. > > > > The check would only be done on first initialization to keep load low, > > while still ensuring all users eventually are affected by the killswitch > > (as they will eventually need to logout/reboot for some reason or another). > > > > The killswitch would probably work best if it had some kind of version > > check in it so we could specify which versions are disabled. > > That would allow for subsequent updates - once delivered by distributions - > > to restore the functionality (while leaving it disabled for those who > > haven't updated). > > The file we are serving here effectively is the kill switch to all of > KNewStuff.
I'm a bit late to the party but for future reference I think this was/is an architectural scaling problem on the server side as much as a bug on the client. If just https load is the problem then the "hotfix" is to use a HTTP load balancer until fixes make it into the clients, killing the clients is like the last resort ever. I'm sure we have the money to afford a bunch of cloud nodes serving as selective proxy caches for a month to balance out the KNS load on the canonical server. HS