RearOfSignal wrote:spidey3 wrote:...But this was an IT failure, and as an IT pro, with 25 years experience running mission-critical systems, I can state with full confidence that this is my territory.
And based upon that experience I can tell you that the fact that the computers went down at all is a major failure on the part of the IT hardware folks, and a black eye for their managers.
And how have you and/or your employer dealt with incidents such as we had last night?
There have been a few, but not many. It should be possible to switch power supplies, or even switch from a primary to a backup server, without interrupting service. In most cases we have been able to maintain that level of service. The key is to plan for outages, have redundant systems in place so that outages are kept localized to the system with the fault, and have documented procedures for dealing with outages (planned or unplanned).
In a few cases, I have experienced outages due to redundant systems which didn't work as expected in the face of an outage, or which caused an interruption of service during maintenance. Some of those events have had wider reaching consequences. In those cases, the follow-up is to investigate not only the proximal cause of the problem, but also the root causes: Was a procedure not followed correctly? Were shortcuts taken which should not have been? Where the proper procedures to avoid outage documented correctly? Is the procedure too complicated? If the procedures were documented correctly and followed accurately, what deficiency in the procedures allowed the outage? Was there some failure mode which was not anticipated? Was a combination of failures not anticipated? Did management insist on a timeframe for the change which was too brief to allow following procedures carefully? Did budgetary pressures drive an inappropriate choice to reduce the amount of redundancy? Etc...
I truly hope that MN is looking not only at the proximal cause for this incident, but also at the procedural matters...