On Thu, Feb 10, 2022 at 1:11 PM Aleix Pol <aleix...@kde.org> wrote:
>
> On Thu, Feb 10, 2022 at 11:05 AM Ben Cooksley <bcooks...@kde.org> wrote:
> >
> >
> >
> > On Thu, Feb 10, 2022 at 8:20 AM Aleix Pol <aleix...@kde.org> wrote:
> >>
> >> [Snip]
> >>
> >> We still haven't discussed here is how to prevent this problem from
> >> happening again.
> >>
> >> If we don't have information about what is happening, we cannot fix 
> >> problems.
> >
> >
> > Part of the issue here is that the problem only came to Sysadmin attention 
> > very recently, when the system ran out of disk space as a result of growing 
> > log files.
> > It was at that point we realised we had a serious problem.
> >
> > Prior to that the system load hadn't climbed to dangerous levels (> number 
> > of CPU cores) and Apache was keeping up with the traffic, so none of our 
> > other monitoring was tripped.
> >
> > If you have any thoughts on what sort of information you are thinking of 
> > that would be helpful.
>
> We could have plots of the amount of queries we get with a KNewStuff/*
> user-agent over time and their distribution.
>
> > It would definitely be helpful though to know when new software is going to 
> > be released that will be interacting with the servers as we will then be 
> > able to monitor for abnormalities.
>
> We make big announcements of every Plasma release... (?)
>
> >> Is there anything that could be done in this front? The issue here
> >> could have been addressed months ago, we just never knew it was
> >> happening.
> >
> >
> > One possibility that did occur to me today would be for us to integrate 
> > some kind of killswitch that our applications would check on first 
> > initialisation of functionality that talks to KDE.org servers.
> > This would allow us to disable the functionality in question on user 
> > systems.
> >
> > The check would only be done on first initialization to keep load low, 
> > while still ensuring all users eventually are affected by the killswitch 
> > (as they will eventually need to logout/reboot for some reason or another).
> >
> > The killswitch would probably work best if it had some kind of version 
> > check in it so we could specify which versions are disabled.
> > That would allow for subsequent updates - once delivered by distributions - 
> > to restore the functionality (while leaving it disabled for those who 
> > haven't updated).
>
> The file we are serving here effectively is the kill switch to all of 
> KNewStuff.

I'm a bit late to the party but for future reference I think this
was/is an architectural scaling problem on the server side as much as
a bug on the client. If just https load is the problem then the
"hotfix" is to use a HTTP load balancer until fixes make it into the
clients, killing the clients is like the last resort ever. I'm sure we
have the money to afford a bunch of cloud nodes serving as selective
proxy caches for a month to balance out the KNS load on the canonical
server.

HS

Reply via email to