Hi All, Unfortunately it looks like this switch decided to reboot again last night at around 1AM PDT. Thankfully the impact was smaller than before with all of the adjustments we made in the recent weeks.
I wanted to send another update on how we're going to permanently fix this moving forward. I have racked two "new" Arista 1G switches which will replace two of the three Cisco Nexus fabric extenders where the majority of our hosts are. Once I have those plumbed into the new switches, I can start moving hosts over to these switches one by one. I'll send out another email with a list of hosts this will impact in a few weeks once it's ready. Before that happens we need to finish running fiber to our second core switch and finish the MLAG configuration and backend upstream connection. Once this is finished, we'll have more redundancy in our network. There will be another brief outage when we switch over to the new "core" switches with MLAG. Thanks again for your patience. Hopefully I can get these all done before this switch decides to reboot again! On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <la...@osuosl.org> wrote: > Sadly this just happened again about 50 minutes ago. We may need to do > some emergency firmware patching tomorrow. As a backup plan, I'm also > formulating a plan to add another switch to try and minimize the impact of > this troublesome switch. > > Once I gather some additional information tomorrow morning, I'll send an > update on what we're planning to do. > > Thanks again for your patience. > > On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <la...@osuosl.org> wrote: > >> This happened again at approximately 10AM PDT. Since we moved our uplink >> to this switch, everything went down while the switch rebooted. >> >> We're still planning on doing an upgrade but don't have a date yet for >> that. We'll hopefully get that going soon. >> >> Thanks for your patience. >> >> On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <la...@osuosl.org> wrote: >> >>> Unfortunately this just happened again overnight. We may need to >>> schedule another outage to perform some software upgrade on this switch so >>> that this stops happening. We'll send an announcement out once we have >>> everything in place to do that upgrade. >>> >>> Thanks- >>> >>> On Wed, May 25, 2022 at 11:22 PM Lance Albertson <la...@osuosl.org> >>> wrote: >>> >>>> All, >>>> >>>> It appears that one of our core network switches had a kernel panic and >>>> rebooted which caused widespread outages throughout our infrastructure. As >>>> of right now, everything appears to be back to normal but please let me >>>> know if that isn't the case by sending an email to supp...@osuosl.org. >>>> >>>> Apologies for the outage and we'll be looking into why this switch had >>>> a kernel panic in the first place. >>>> >>>> Thanks- >>>> >>>> -- >>>> Lance Albertson >>>> Director >>>> Oregon State University | Open Source Lab >>>> >>> >>> >>> -- >>> Lance Albertson >>> Director >>> Oregon State University | Open Source Lab >>> >> >> >> -- >> Lance Albertson >> Director >> Oregon State University | Open Source Lab >> > > > -- > Lance Albertson > Director > Oregon State University | Open Source Lab > -- Lance Albertson Director Oregon State University | Open Source Lab
_______________________________________________ Hosting mailing list host...@osuosl.org https://lists.osuosl.org/mailman/listinfo/hosting