I have the "new" switch setup and ready to go. I'm currently planning on doing this switch in about 20 minutes (3pm PDT). You will see a set of outages as I plan to do the following:
1. Move LinkOregon uplink to "new" switch 2. Move oslsw3 uplink to "new" switch 3. Move oslsw1 uplink to "new" switch 4. Move remaining backend 10g switches If anything goes wrong, I should be able to quickly revert the change. On Tue, Sep 13, 2022 at 11:08 AM Lance Albertson <la...@osuosl.org> wrote: > All, > > I wanted to pass along more information on where we're at and our current > plans to try and work around this issue. > > Without going deep into the history of our core network infrastructure, we > have two core "routers" that are both aging and we're in the process of > replacing them with something newer. > > Previously, our uplink was connected through our Cisco 6509. This switch > has several 1G line cards that half of our servers are directly connected > to. > > The other core switch is a Cisco Nexus 6001 which has three fabric > extenders which provide 1G connectivity to the other half of our servers. > When we migrated over to the LinkOregon network, we moved the uplink over > to this Nexus 6k as it was much easier to get LR optics for it. > > Unfortunately this Nexus 6k has started kernel panicking and rebooting in > the past several months multiple times causing these outages. Much of our > downlink 10G switches are connected to this Nexus 6k which means there's a > larger impact when it goes down. > > A few years ago a high speed trading company donated us a pallet full of > Arista switches and I've been slowly adding to our infrastructure. Even > though they are EOL, they still work very well and we haven't had any > problems with them. And since I have a lot of them, I can easily replace > one if one goes bad. > > My current plan is to set up one of these Arista switches and move all of > the current 10G connections to it. This way, at least we can reduce the > impact if/when this Nexus 6k switch reboots again. In theory, it should > only affect the servers directly connected to the FEX switches if it > reboots again. > > I reached out to the OSU IT community and they graciously donated two > 10G-LR optical modules so that I can put this plan in place without having > to wait to ship modules. > > Current plan for today: > - Setup new Arista switch > - Move upstream connectivity to LinkOregon to it > - Move all downstream 10G links to this router > > I will send another email when I plan to do the actual outages for the cut > over. > > Longer term plans > - Work with vendors to replace our aging core network infrastructure with > something that's still supported and we can afford > - Look into getting redundancy put into place so that we don't have this > issue anymore > - Migrate off of the older equipment > > If anyone on this list has connections to Arista or any other major edge > networking vendor, please let me know. That will certainly help our > situation in the long term! > > I had already started working on a plan to replace these systems but it > seems my time may have run out (at least for the Nexus 6k switch). > > Thanks all for your patience and support! > > On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <la...@osuosl.org> wrote: > >> Sadly this just happened again about 50 minutes ago. We may need to do >> some emergency firmware patching tomorrow. As a backup plan, I'm also >> formulating a plan to add another switch to try and minimize the impact of >> this troublesome switch. >> >> Once I gather some additional information tomorrow morning, I'll send an >> update on what we're planning to do. >> >> Thanks again for your patience. >> >> On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <la...@osuosl.org> wrote: >> >>> This happened again at approximately 10AM PDT. Since we moved our uplink >>> to this switch, everything went down while the switch rebooted. >>> >>> We're still planning on doing an upgrade but don't have a date yet for >>> that. We'll hopefully get that going soon. >>> >>> Thanks for your patience. >>> >>> On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <la...@osuosl.org> >>> wrote: >>> >>>> Unfortunately this just happened again overnight. We may need to >>>> schedule another outage to perform some software upgrade on this switch so >>>> that this stops happening. We'll send an announcement out once we have >>>> everything in place to do that upgrade. >>>> >>>> Thanks- >>>> >>>> On Wed, May 25, 2022 at 11:22 PM Lance Albertson <la...@osuosl.org> >>>> wrote: >>>> >>>>> All, >>>>> >>>>> It appears that one of our core network switches had a kernel panic >>>>> and rebooted which caused widespread outages throughout our >>>>> infrastructure. >>>>> As of right now, everything appears to be back to normal but please let me >>>>> know if that isn't the case by sending an email to supp...@osuosl.org. >>>>> >>>>> Apologies for the outage and we'll be looking into why this switch had >>>>> a kernel panic in the first place. >>>>> >>>>> Thanks- >>>>> >>>> > -- > Lance Albertson > Director > Oregon State University | Open Source Lab > -- Lance Albertson Director Oregon State University | Open Source Lab
_______________________________________________ Hosting mailing list host...@osuosl.org https://lists.osuosl.org/mailman/listinfo/hosting