Re: Data Center testing
On Mon, Aug 24, 2009 at 09:38:38AM -0400, Dan Snyder wrote: > We have done power tests before and had no problem. I guess I am looking > for someone who does testing of the network equipment outside of just power > tests. We had an outage due to a configuration mistake that became apparent > when a switch failed. It didn't cause a problem however when we did a power > test for the whole data center. Dan, With all due respect, if there are config changes being made to your devices that aren't authorized or in accordance with your standards (you *do* have config standards, right?) then you don't have a testing problem, you have a data integrity problem. Periodically inducing failures to catch them is sorta like using your smoke detector as an oven timer. There are several tools that can help in this area; a good free one is rancid [1], which logs in to your routers and collects copies of configs and other info, all of which gets stored in a central repository. By default, you will be notified via email of any changes. An even better approach than scanning the hourly config diff emails is to develop scripts that compare the *actual* state of the network with the *desired* state and alert you if the two are not in sync. Obviously this is more work because you have to have some way of describing the desired state of the network in machine-parsable format, but the benefit is that you know in pseudo-realtime when something is wrong, as opposed to finding out the next time a device fails. Rancid diffs + tacacs logs will tell you who made the changes, and with that info you can get at the root of the problem. Having said that, every planned maintenance activity is an opportunity to run through at least some failure cases. If one of your providers is going to take down a longhaul circuit, you can observe how traffic re-routes and verify that your metrics and/or TE are doing what you expect. Any time you need to load new code on a device you can test that things fail over appropriately. Of course, you have to willing to just shut the device down without draining it first, but that's between you and your customers. Link and/or device failures will generate routing events that could be used to test convergence times across your network, etc. The key is to be prepared. The more instrumentation you have in place prior to the test, the better you will be able to analyze the impact of the failure. An experienced operator can often tell right away when looking at a bunch of MRTG graphs that "something doesn't look right", but that doesn't tell you *what* is wrong. There are tools (free and commercial) that can help here, too. Have a central syslog server and some kind of log reduction tool in place. Have beacons/probes deployed, in both the control and data planes. If you want to record, analyze, and even replay routing system events, you might want to take a look at the Route Explorer product from Packet Design [2]. You said "switch failure" above, so I'm guessing that this doesn't apply to you, but there are also good network simulation packages out there. Cariden [3] and WANDL [4] can build models of your network based on actual router configs and let you simulate the impact of various scenarios, including device/link failures. However, these tools are more appropriate for design and planning than for catching configuration mistakes, so they may not be what you're looking for in this case. --Jeff [1] http://www.shrubbery.net/rancid/ [2] http://www.packetdesign.com/products/rex.htm [3] http://www.cariden.com/ [4] http://www.wandl.com/html/index.php
SORBS?
I need a SORBS maintainer to contact me. The SORBS site reports the site and databases are in maintenance mode for the second day in a row. One of my domains was legitimately listed, but now that I've resolved the problem, I'm unable to request removal. Regards, Tim R. Rainier Systems Administrator II Kalsec Inc. www.kalsec.com
Re: SORBS?
On Tue, 25 Aug 2009 train...@kalsec.com wrote: I need a SORBS maintainer to contact me. The SORBS site reports the site and databases are in maintenance mode for the second day in a row. One of my domains was legitimately listed, but now that I've resolved the problem, I'm unable to request removal. Based on info previously posted to the SORBS web site, I suspect SORBS may be in the middle of relocating their servers (changing hosting providers). If that's the case, I don't think you're going to have any luck getting changes made to the SORBS database until the move has been completed. -- Jon Lewis | I route Senior Network Engineer | therefore you are Atlantic Net| _ http://www.lewis.org/~jlewis/pgp for PGP public key_
Re: SORBS?
On Aug 25, 2009, at 8:40 AM, train...@kalsec.com wrote: I need a SORBS maintainer to contact me. I don't think they watch here; at least I've never seen Michelle post here. Try dnsbl-users, the SORBS mailling list. From the google cache of the Mailling Lists page -- "This list is an open list where the SORBS DNSbl may be discussed. If it is about the SORBS DNSbl it is on topic (including questions on how to configure mailers to use SORBS). Currently this list is quiet, un- moderated, and anyone is free to join. Non-members of the list are not permitted to send mail to the list. For people who don't know the meaning of "confirmed opt-in" ("double opt-in" as most spammers call it), subscribe to this list and you will see how it works. Subscription is performed by sending a message to: majord...@dnsbl.sorbs.net with a message body of: subscribe dnsbl-users end "
Re: SORBS?
Thanks for the replies. I will use the mailing list if my issue doesn't get resolved. Regards, Tim R. Rainier Systems Administrator II Kalsec Inc. www.kalsec.com Marc Powell wrote on 08/25/2009 10:35:43 AM: > From: > > Marc Powell > > To: > > NANOG list > > Date: > > 08/25/2009 10:36 AM > > Subject: > > Re: SORBS? > > > On Aug 25, 2009, at 8:40 AM, train...@kalsec.com wrote: > > > I need a SORBS maintainer to contact me. > > I don't think they watch here; at least I've never seen Michelle post > here. Try dnsbl-users, the SORBS mailling list. From the google cache > of the Mailling Lists page -- > > "This list is an open list where the SORBS DNSbl may be discussed. If > it is about the SORBS DNSbl it is on topic (including questions on how > to configure mailers to use SORBS). Currently this list is quiet, un- > moderated, and anyone is free to join. Non-members of the list are not > permitted to send mail to the list. > > For people who don't know the meaning of "confirmed opt-in" ("double > opt-in" as most spammers call it), subscribe to this list and you will > see how it works. > > Subscription is performed by sending a message to: majord...@dnsbl.sorbs.net > with a message body of: > subscribe dnsbl-users > end > " > > >
Re: SORBS?
On Tue, 2009-08-25 at 09:35 -0500, Marc Powell wrote: > I don't think they watch here; at least I've never seen Michelle post > here. I've had confirmation from Michelle personally this morning (following a similar question elsewhere) that the SORBS systems are indeed relocating. From a previous message to SPAM-L (reproduced with permission): Michelle Sullivan wrote: > SORBS is not closing. SORBS has received 3 credible offers for the > purchase of SORBS, one of which was not interested in continuing SORBS > but obtaining the IP and spamtraps. SORBS will not be accepting the > latter offer. > > Currently the two offers being considered are with anti-spam vendors > and one of the two have indicated that they will not commercialise > SORBS, but keep it as a community project. The other anti-spam vendor > have indicated they would pursue a split commercial model, where there > would be a free service as well as a 'premium' service (how this would > work I do not know). > > An announcement about which company is successful will be forthcoming > when necessary paperwork has been signed. > > Small outages will occur in the central database when the servers are > moved, this will NOT affect SORBS services globally, only updates > (listing and delisting) and local (Au) services during the outages. As inconvenient as this outage may be, the background to it is one with which a large proportion of this list is probably bearing scars - physical relocation. On a related note, no I don't have any information as to who it is that has taken SORBS on. Regards, Graeme
Re: Alternatives to storm-control on Cat 6509.
On Mon, Aug 24, 2009 at 4:59 PM, Nick Hilliard wrote: > On 24/08/2009 19:03, Holmes,David A wrote: > >> Additionally, and perhaps most significantly for deterministic network >> design, the copper cards share input hardware buffers for every 8 ports. >> Running one port of the 8 at wire speed will cause input drops on the >> other 7 ports. Also, the cards connect to the older 32 Gbps shared bus. >> > > IMO, a more serious problem with the 6148tx and 6548tx cards is the > internal architecture, which is effectively six internal managed gigabit > ethernet hubs (i.e. shared bus) with a 1M buffer per hub, and each hub > connected with a single 1G uplink to a 32 gig backplane. Ref: > > >> http://www.cisco.com/en/US/products/hw/switches/ps700/products_tech_note09186a00801751d7.shtml#ASIC >> > > In Cisco's own words: "These line cards are oversubscription cards that are > designed to extend gigabit to the desktop and might not be ideal for server > farm connectivity". In other words, these cards are fine in their place, > but they are not designed or suitable for data centre usage. > > I don't want to sound like I'm damning this card beyond redemption - it has > a useful place in this world - but at the expense of reliability, > manageability and configuration control, you will get useful features > (including broadcast/unicast flood control) and in many situations very > significantly better performance from a recent SRW 48-port linksys gig > switch than from one of these cards. > > Nick > > We experienced the joy of using the X6148 cards with a SAN/ESX cluster. Lots of performance issues! A fairly inexpensive solution was to switch to the X6148A card instead, which does not suffer the the 8:1 oversubscription. It also supports MTU's larger than 1500, which was another shortcoming of the older card. Mike -- Mike Bartz m...@bartzfamily.net
Re: FCCs RFC for the Definition of Broadband
It's not a technical question, it's a political one, so feel free to squelch this for off-topicness if you want. Technically, broadband is "faster than narrowband", and beyond that it's "fast enough for what you're trying to sell"; tell me what you're trying to sell and I'll tell you how fast a connection you need. If you're trying to sell email, VOIP, and lightly-graphical web browsing, 64kbps is enough, and 128 is better. If you're trying to sell wireless data excluding laptop tethering, that's also fast enough for anything except maybe uploading hi-res camera video. If you're trying to sell talking-heads video conferencing, 128's enough but 384's better. If you're trying to sell internet radio, somewhere around 300 is probably enough. If you're trying to sell online gaming, you'll need to find a WoW addict; I gather latency's a bit more of an issue than bandwidth for most people. If you're trying to sell home web servers - oh wait, they're not! - 100-300k's usually enough, unless you get slashdotted, in which case you need 50-100Mbps for a couple of hours. If you're trying to sell Youtube-quality video, 1 Mbps is enough, 3 Mbps is better. If you're trying to sell television replacement, 10M's about enough for one HD channel, 20's better, but the real question is what kind of multicast upstream infrastructure you're using to manage the number of channels you're selling, and whether you're price-competitive with cable, satellite, or radio broadcast, and how well you get along with your city and state regulators who'd like a piece of the action. If what you're trying to sell is "the relevance of the FCC to the Democratic political machines", the answer is measured in TV-hours, newspaper-inches, and letters to Congresscritters, which isn't my problem.
Re: FCCs RFC for the Definition of Broadband
On Aug 24, 2009, at 9:17 AM, Luke Marrott wrote: What are your thoughts on what the definition of Broadband should be going forward? I would assume this will be the standard definition for a number of years to come. Historically, narrowband was circuit switched (ISDN etc) and broadband was packet switched. Narrowband was therefore tied to the digital signaling hierarchy and was in some way a multiple of 64 KBPS. As the term was used then, broadband delivery options of course included virtual circuits bearing packets, like Frame Relay and ATM. The new services I am hearing about include streamed video to multiple HD TVs in the home. I think I would encourage the FCC to discuss "broadband" to step away from the technology and look at the bandwidth usably delivered (as in "I don't care what the bit rate of the connection at the curb is if the back end is clogged; how much can a commodity TCP session move through the network"). http://tinyurl.com/pgxqzb suggests that the average broadband service worldwide delivers a download rate of 1.5 MBPS; having the FCC assert that the new definition of broadband is that it delivers a usable data rate in excess of 1 MBPS while narrowband delivers less seems reasonable. That said, the US is ~15th worldwide in broadband speed; Belgium, Ireland, South Korea, Taiwan, and the UK seem to think that FTTH that can serve multiple HDTVs simultaneously is normal.
RE: Data Center testing
There's more to data integrity in a data center (well, anything powered, that is) than network configurations. There's the loading of individual power outlets, UPS loading, UPS battery replacement cycles, loading of circuits, backup lighting, etc. And the only way to know if something is really working like it's designed is to test it. That's why we have financial auditors, military exercises, fire drills, etc. So while your analogy emphasizes the importance of having good processes in place to catch the problems up front, it doesn't eliminate throwing the switch. Frank -Original Message- From: Jeff Aitken [mailto:jait...@aitken.com] Sent: Tuesday, August 25, 2009 7:53 AM To: Dan Snyder Cc: NANOG list Subject: Re: Data Center testing On Mon, Aug 24, 2009 at 09:38:38AM -0400, Dan Snyder wrote: > We have done power tests before and had no problem. I guess I am looking > for someone who does testing of the network equipment outside of just power > tests. We had an outage due to a configuration mistake that became apparent > when a switch failed. It didn't cause a problem however when we did a power > test for the whole data center. Dan, With all due respect, if there are config changes being made to your devices that aren't authorized or in accordance with your standards (you *do* have config standards, right?) then you don't have a testing problem, you have a data integrity problem. Periodically inducing failures to catch them is sorta like using your smoke detector as an oven timer. There are several tools that can help in this area; a good free one is rancid [1], which logs in to your routers and collects copies of configs and other info, all of which gets stored in a central repository. By default, you will be notified via email of any changes. An even better approach than scanning the hourly config diff emails is to develop scripts that compare the *actual* state of the network with the *desired* state and alert you if the two are not in sync. Obviously this is more work because you have to have some way of describing the desired state of the network in machine-parsable format, but the benefit is that you know in pseudo-realtime when something is wrong, as opposed to finding out the next time a device fails. Rancid diffs + tacacs logs will tell you who made the changes, and with that info you can get at the root of the problem. Having said that, every planned maintenance activity is an opportunity to run through at least some failure cases. If one of your providers is going to take down a longhaul circuit, you can observe how traffic re-routes and verify that your metrics and/or TE are doing what you expect. Any time you need to load new code on a device you can test that things fail over appropriately. Of course, you have to willing to just shut the device down without draining it first, but that's between you and your customers. Link and/or device failures will generate routing events that could be used to test convergence times across your network, etc. The key is to be prepared. The more instrumentation you have in place prior to the test, the better you will be able to analyze the impact of the failure. An experienced operator can often tell right away when looking at a bunch of MRTG graphs that "something doesn't look right", but that doesn't tell you *what* is wrong. There are tools (free and commercial) that can help here, too. Have a central syslog server and some kind of log reduction tool in place. Have beacons/probes deployed, in both the control and data planes. If you want to record, analyze, and even replay routing system events, you might want to take a look at the Route Explorer product from Packet Design [2]. You said "switch failure" above, so I'm guessing that this doesn't apply to you, but there are also good network simulation packages out there. Cariden [3] and WANDL [4] can build models of your network based on actual router configs and let you simulate the impact of various scenarios, including device/link failures. However, these tools are more appropriate for design and planning than for catching configuration mistakes, so they may not be what you're looking for in this case. --Jeff [1] http://www.shrubbery.net/rancid/ [2] http://www.packetdesign.com/products/rex.htm [3] http://www.cariden.com/ [4] http://www.wandl.com/html/index.php
Re: Data Center testing
On Tue, Aug 25, 2009 at 7:53 AM, Jeff Aitken wrote: >[..] Periodically inducing failures to catch [...] them is sorta like using >your smoke detector as an oven timer. >[..] > machine-parsable format, but the benefit is that you know in pseudo-realtime > when something is wrong, as opposed to finding out the next time a device > fails. Config checking can't say much about silent hardware failures. Unanticipated problems are likely to arise in failover systems, especially complicated ones. A failover system that has not been periodically verified may not work as designed. Simulations, config review, and change controls are not substitutes for testing, they address overlapping but different problems. Testing detects unanticipated error; config review is a preventive measure that helps avoid and correct apparent configuration issues. Config checking (both software and hardware choices) also help to keep out unnecessary complexity. A human still has to write the script and review its output -- an operator error would eventually occur that is an accidental omission from both the current state and from the "desired" state; there is a chance that an erroneous entry escapes detection. There can be other types of errors: Possibly there is a damaged patch cable, dying port, failing power supply, or other hardware on the warm spare that has silently degraded and its poor condition won't be detected(until it actually tries to take a heavy workload, blows a fuse, eats a transceiver, and everything just falls apart). Perhaps you upgraded a hardware module or software image X months ago, to fix bug Y on the secondary unit, and the upgrade caused completely unanticipated side effect Z. Config checking can't say much about silent hardware problems. -- -Mysid
Re: Data Center testing
Most Provider type datacenters I've worked with get a lot of flak from customers when they announce they're doing network failover testing, because there's always going to be a certain amount of chance (at least) of disruption. Its the exception to find a provider that does it I think (or maybe just one that admits it when they're doing it). Power tests are a different thing. As for testing your own equipment, there are a couple ways to do that, regular failover tests (quarterly, or more likely at 6 month intervals), and/or routing traffic so that you have some of your traffic on all paths (ie internal traffic on one path, external traffic on another). The latter doesn't necessarily tell you that your failover will work perfectly, only that all your gear in the 2nd path is functioning. I prefer doing both. When doing the failover tests, no matter how good your setup is, there's always a chance for taking a hit, so I always do this kind of work during a maintenance window, not too close to quarter end, etc. If you have your equipment set up correctly of course, it goes like butter and is a total non-event. For test procedure, I usually pull cables. I'll go all the way to line cards or power cables if I really want to test, though that can be hard on equipment. E On Mon, Aug 24, 2009 at 10:45 AM, Jack Bates wrote: > Dan Snyder wrote: > >> We have done power tests before and had no problem. I guess I am looking >> for someone who does testing of the network equipment outside of just >> power >> tests. We had an outage due to a configuration mistake that became >> apparent >> when a switch failed. It didn't cause a problem however when we did a >> power >> test for the whole data center. >> >> > The plus side of failure testing is that it can be controlled. The downside > to failure testing is that you can induce a failure. Maintenance windows are > cool, but some people really dislike failures of any type which limits how > often you can test. I personally try for once a year. However, a lot can go > wrong in a year. > > Jack > >