Should have sent this earlier, but here we go
https://wikitech.wikimedia.org/wiki/Incidents/2024-10-09_calico-codfw

On Wed, Oct 9, 2024 at 4:20 PM Brooke Vibber <[email protected]> wrote:

> Yikes! And now... We know (knowing is half the battle)
>
> -- brooke
>
> On Wed, Oct 9, 2024, 12:12 AM Alexandros Kosiaris <[email protected]>
> wrote:
>
>> Hi,
>>
>> Well, there was one surprise, and it did explode! It was probably very
>> close to a full outage in codfw. We are still in the process of documenting
>> everything and we 'll be publishing a full incident report, but the TL;DR
>> is that every single mwscript-k8s invocation that is happening in the for
>> loop creates an entire helm release, including all k8s resources. This is
>> by design, but it did have an unforeseen consequence and that is that close
>> to 2k Calico Network Policies were created (including a ton of other
>> resources, which would create their own set of problems), which meant all
>> Calico components in k8s had to gradually react to the increasing number of
>> those, which ended up hitting resources limits for some of those, which led
>> to throttles and then to failures and into a slowly cascading outage that
>> was putting hardware nodes one by one out of rotation. The last couple of
>> hours were interesting for some of us, I can tell you that.
>>
>> We are already working on plans (we got enough action items already,
>> including amending the design) on how to fix this, but in the meantime, I
>> 'd like to request that until we provide an update that we 've solved this,
>> don't spawn mwscript-k8s in a for loop or anything similar. You can
>> continue working of course with the tool to get acquainted with it, find
>> bugs etc, just please don't spawn hundreds or even worse thousands of
>> invocations.
>>
>> Brooke, I had to kill your bash shell on deploy2002 doing the transcodes.
>> I am sorry about that, but despite attaching to your screen I didn't manage
>> to find how to stop it (it didn't respond to any of the usual control
>> sequences and shell job controls) and I didn't want to risk 1 more outage
>> (which would probably happen once the resources were reaching some critical
>> number).
>>
>> On Wed, Oct 9, 2024 at 6:16 AM Reuven Lazarus <[email protected]>
>> wrote:
>>
>>> Great to hear, thanks!
>>>
>>> As a side note for others, to highlight something Brooke said in
>>> passing: Wrapping mwscript-k8s in a bash for loop is a fine idea, *as
>>> long as* you're running it with --follow or --attach. In that case,
>>> each mwscript-k8s invocation will keep running to monitor the job's output,
>>> and will terminate when the job terminates. One job will run at a time,
>>> which is what you expect.
>>>
>>> Without --follow or --attach, mwscript-k8s is just the launcher: it
>>> kicks off your job, then terminates immediately. Your for loop will
>>> rapidfire *launch* all the jobs one after another, which means hundreds
>>> of 'em might be executing simultaneously, and that might not be what you
>>> had in mind. If your job involves expensive DB operations, it *really*
>>> might not be what you had in mind.
>>>
>>> First-class dblist support will indeed make that pitfall easier to
>>> avoid. In the meantime there's nothing wrong with using a for loop, and
>>> it's what I'd do too -- but since this is a new system and nobody has
>>> well-honed intuition for it yet, I wanted to draw everyone's eye to that
>>> distinction.
>>>
>>> On Tue, Oct 8, 2024 at 4:37 PM Brooke Vibber <[email protected]>
>>> wrote:
>>>
>>>> I'm starting some batch maintenance of video transcodes so I'm
>>>> exercising the new k8s-based maint script system on TMH's
>>>> requeueTranscodes.php; good news: no surprises so far, everything's working
>>>> just fine. :D
>>>>
>>>> Since I'm running the same scripts over multiple wikis I went ahead and
>>>> manually wrapped them in a bash for loop so it's submitting one job at a
>>>> time out of all.dblist, using a screen session for the wrapper loop and
>>>> tailing the logs to the session so they don't all smash out at once, and a
>>>> second manually-started run for Commons. :)
>>>>
>>>> First-class support for running over a dblist will be a very welcome
>>>> improvement, and should be pretty straightforward! Good work everybody. :D
>>>>
>>>> The longest job (Commons) might take a couple days to run, so we'll see
>>>> if anything explodes later! hehe
>>>>
>>>> -- brooke
>>>>
>>>> On Wed, Sep 25, 2024 at 8:11 PM Reuven Lazarus <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> With MediaWiki at the WMF moving to Kubernetes, it's now time to start
>>>>> running manual maintenance scripts there. Any time you would previously 
>>>>> SSH
>>>>> to a mwmaint host and run mwscript, follow these steps instead. The old 
>>>>> way
>>>>> will continue working for a little while, but it will be going away.
>>>>>
>>>>>
>>>>> What's familiar:
>>>>>
>>>>> Starting a maintenance script looks like this:
>>>>>
>>>>>   rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php
>>>>> --wiki=enwiki
>>>>>
>>>>> Any options for the mwscript-k8s tool, as described below, go before
>>>>> the --.
>>>>>
>>>>> After the --, the first argument is the script name; everything else
>>>>> is passed to the script. This is the same as you're used to passing to
>>>>> mwscript.
>>>>>
>>>>>
>>>>> What's different:
>>>>>
>>>>> - Run mwscript-k8s on a deployment host, not the maintenance host.
>>>>> Either deployment host will work; your job will automatically run in
>>>>> whichever data center is active, so you no longer need to change hosts 
>>>>> when
>>>>> there’s a switchover.
>>>>>
>>>>> - You don't need a tmux. By default the tool launches your
>>>>> maintenance script and exits immediately, without waiting for your job to
>>>>> finish. If you log out of the deployment host, your job keeps running on
>>>>> the Kubernetes cluster.
>>>>>
>>>>> - Kubernetes saves the maintenance script's output for seven days
>>>>> after completion. By default, mwscript-k8s prints a kubectl command that
>>>>> you (or anyone else) can paste and run to monitor the output or save it to
>>>>> a file.
>>>>>
>>>>> - As a convenience, you can pass -f (--follow) to mwscript-k8s to 
>>>>> immediately
>>>>> begin tailing the script output. If you like, you can do this inside
>>>>> a tmux and keep the same workflow as before. Either way, you can safely
>>>>> disconnect and your script will continue running on Kubernetes.
>>>>>
>>>>>   rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki
>>>>>
>>>>>   [...]
>>>>>
>>>>>   MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September
>>>>> 2024)
>>>>>
>>>>> - For scripts that take input on stdin, you can pass --attach to
>>>>> mwscript-k8s, either interactively or in a pipeline.
>>>>>
>>>>>   rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki
>>>>>
>>>>>   [...]
>>>>>
>>>>>   Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman
>>>>>
>>>>>   > $wmgRealm
>>>>>
>>>>>   = "production"
>>>>>
>>>>>   >
>>>>>
>>>>>   rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach --
>>>>> purgeList.php
>>>>>
>>>>>   [...]
>>>>>
>>>>>   Purging 1 urls
>>>>>
>>>>>   Done!
>>>>>
>>>>> - Your maintenance script runs in a Docker container which will not
>>>>> outlive it, so it can't save persistent files to disk. Ensure your
>>>>> script logs its important output to stdout, or persists it in a database 
>>>>> or
>>>>> other remote storage.
>>>>>
>>>>> - The --comment flag sets an optional (but encouraged) descriptive
>>>>> label, such as a task number.
>>>>>
>>>>> - Using standard kubectl commands[1][2], you can check the status, and
>>>>> view the output, of your running jobs or anyone else's. (Example: 
>>>>> `kube_env
>>>>> mw-script codfw; kubectl get pod -l username=rzl`)
>>>>>
>>>>> [1]: https://wikitech.wikimedia.org/wiki/Kubernetes/Kubectl
>>>>>
>>>>> [2]: https://kubernetes.io/docs/reference/kubectl/quick-reference/
>>>>>
>>>>>
>>>>> What's not supported yet:
>>>>>
>>>>> - Maintenance scripts launched automatically on a timer. We're working
>>>>> on migrating them -- for now, this is for one-off scripts launched by 
>>>>> hand.
>>>>>
>>>>> - If your job is interrupted (e.g. by hardware problems), Kubernetes
>>>>> can automatically move it to another machine and restart it, babysitting 
>>>>> it
>>>>> until it completes. But we only want to do that if your job is safe to
>>>>> restart. So by default, if your job is interrupted, it will stay stopped
>>>>> until you restart it yourself. Soon, we'll add an option to declare "this
>>>>> is idempotent, please restart it as needed" and that design is recommended
>>>>> for new scripts.
>>>>>
>>>>> - No support yet for mwscriptwikiset, foreachwiki,
>>>>> foreachwikiindblist, etc, but we'll add similar functionality as flags to
>>>>> mwscript_k8s.
>>>>>
>>>>>
>>>>> Your feedback:
>>>>>
>>>>> Let me know by email or IRC, or on Phab (T341553
>>>>> <https://phabricator.wikimedia.org/T341553>). If mwscript-k8s doesn't
>>>>> work for you, for now you can fall back to using the mwmaint hosts as
>>>>> before -- but they will be going away. Please report any problems sooner
>>>>> rather than later, so that we can ensure the new system meets your needs
>>>>> before that happens.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Reuven, for Service Ops SRE
>>>>> _______________________________________________
>>>>> Wikitech-l mailing list -- [email protected]
>>>>> To unsubscribe send an email to [email protected]
>>>>>
>>>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>>>>
>>>> _______________________________________________
>>>> Wikitech-l mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
>>>>
>>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>>>
>>>
>>>
>>> --
>>> *Reuven Lazarus *(he/him)
>>> Staff Site Reliability Engineer
>>> Wikimedia Foundation <https://wikimediafoundation.org/>
>>> <https://wikimediafoundation.org/>
>>> _______________________________________________
>>> Wikitech-l mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>>
>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>>
>>
>>
>> --
>> Alexandros Kosiaris
>> Principal Site Reliability Engineer
>> Wikimedia Foundation
>> _______________________________________________
>> Wikitech-l mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



-- 
Alexandros Kosiaris
Principal Site Reliability Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to