On Mon, 1 Feb 2010 21:21:52 -0500 Chadwick Sorrell <mirot...@gmail.com> wrote:
> Hello NANOG, > > Long time listener, first time caller. > > A recent organizational change at my company has put someone in charge > who is determined to make things perfect. We are a service provider, > not an enterprise company, and our business is doing provisioning work > during the day. We recently experienced an outage when an engineer, > troubleshooting a failed turn-up, changed the ethertype on the wrong > port losing both management and customer data on said device. This > isn't a common occurrence, and the engineer in question has a pristine > track record. > Why didn't the customer have a backup link if their service was so important to them and indirectly your upper management? If your upper management are taking this problem that seriously, then your *sales people* didn't do their job properly - they should be ensuring that customers with high availability requirements have a backup link, or aren't led to believe that the single-point-of-failure service will be highly available. > This outage, of a high profile customer, triggered upper management to > react by calling a meeting just days after. Put bluntly, we've been > told "Human errors are unacceptable, and they will be completely > eliminated. One is too many." > If upper management don't understand that human error is a risk factor that can't be completely eliminated, then I suggest "self-eliminating" and find yourself a job somewhere else. The only way you'll avoid human error having any impact on production services is to not change anything - which pretty much means not having a job anyway ... > I am asking the respectable NANOG engineers.... > > What measures have you taken to mitigate human mistakes? > > Have they been successful? > > Any other comments on the subject would be appreciated, we would like > to come to our next meeting armed and dangerous. > > Thanks! > Chad >