Discussion relating to the operations of MTA MetroNorth Railroad including west of Hudson operations and discussion of CtDOT sponsored rail operations such as Shore Line East and the Springfield to New Haven Hartford Line

Moderators: GirlOnTheTrain, nomis, FL9AC, Jeff Smith

  by RearOfSignal
 
spidey3 wrote:
RearOfSignal wrote:MNR has plenty of plans for extended signal outage, but it was quicker to fix problem downtown then to setup trains for signal outage. MNR implemented such a plan shortly after hurricane Irene destroyed tracks and signaling equipment.
No argument with that - but it is not what I am talking about.

I get that you all know rail operations from the inside, and with far greater detail than I do.
I concede that this is your territory.

But this was an IT failure, and as an IT pro, with 25 years experience running mission-critical systems, I can state with full confidence that this is my territory.

And based upon that experience I can tell you that the fact that the computers went down at all is a major failure on the part of the IT hardware folks, and a black eye for their managers.
And how have you and/or your employer dealt with incidents such as we had last night?
  by Tadman
 
Spidey, I think there's a difference between your company - probably a for-profit privately owned company - and a commuter railroad. The commuter railroad funds half their operating costs and none of their capital costs from revenue. The rest is from the perpetually broke gov't. Compare this with your company which (I sure hope) funds every cost plus 5-10% profit margin from revenue, affording them the opportunity to buy proper IT redundancy.
  by ThirdRail7
 
RearOfSignal wrote:
runningwithscalpels wrote:Third Rail, the OCC isn't very old at all.
It was just a couple if years ago that they upgraded the whole OCC.
Then Metro-North must have done some hyper-installation. I haven't been up there in a few months but the last time I was, I remember seeing old catenary between CP244 and CP248 and I also remember seeing some between CP257 and CP261.
spidey3 wrote:
Ridgefielder wrote:If they indeed did not have a backup power supply for the OCC, that is absolutely a management failure, and heads should roll at 347 Madison.
From the latest info it appears that they do in fact have dual redundant power supplies, but that due to human error (either yesterday or at installation time) the redundancy was ineffective. My read is that they needed to replace the primary (probably were seeing imminent fault warning, voltage drift, etc.) - but didn't verify that the alternate supply was properly connected and operational.

There are two major failings indicated:
1) Lack of regular testing [or insufficient testing plan] for power outage scenarios
2) Insufficient planning / prediction of potential breakdowns for the power supply replacement task

Both of those are indicative of poor management oversight.
Or it could be...as mentioned...simple human error. A few weeks ago, Amcrap,NJTrainslate and Wrong Island Railroad had a terrible day because a contractor cut into several communication lines, causing PSCC to lose control of various interlockings. CSX lost control of all of their signals and had a complete code line failure as well. These things happen from time to time.
  by DutchRailnut
 
OCC has nothing to do with catenary, Operations Control Center aka Dispatchers room.
  by alewifebp
 
I'm in IT as well, and I chimed in with similar thoughts when the NJT outage occurred. I get what is being said about following proper procedures, and having a proper plan in place and the correct disaster recovery model functioning. I'm sure they will review procedures and make changes where necessary. But before we rake anyone over the coals, let's also remember something that happened just this last week. Several Google services were down for some time. This outage even resulted in some poor fellow receiving thousands of e-mail messages. One would assume that Google has a downright excellent IT department and nearly unlimited budget. They got it wrong. Similar problems have happened in the past to other tech titans, and will continue in the future.
  by Backshophoss
 
IF lirr42's blog is correct,a contractor "screw-up" created the MN's "nightmare" shutdown that day.
For a forgotten wire connection not checked BEFORE the change over to the alt power source, you wind up
with a total system failure. :(
"The Devil is in The Details"......
  by spidey3
 
RearOfSignal wrote:
spidey3 wrote:...But this was an IT failure, and as an IT pro, with 25 years experience running mission-critical systems, I can state with full confidence that this is my territory.

And based upon that experience I can tell you that the fact that the computers went down at all is a major failure on the part of the IT hardware folks, and a black eye for their managers.
And how have you and/or your employer dealt with incidents such as we had last night?
There have been a few, but not many. It should be possible to switch power supplies, or even switch from a primary to a backup server, without interrupting service. In most cases we have been able to maintain that level of service. The key is to plan for outages, have redundant systems in place so that outages are kept localized to the system with the fault, and have documented procedures for dealing with outages (planned or unplanned).

In a few cases, I have experienced outages due to redundant systems which didn't work as expected in the face of an outage, or which caused an interruption of service during maintenance. Some of those events have had wider reaching consequences. In those cases, the follow-up is to investigate not only the proximal cause of the problem, but also the root causes: Was a procedure not followed correctly? Were shortcuts taken which should not have been? Where the proper procedures to avoid outage documented correctly? Is the procedure too complicated? If the procedures were documented correctly and followed accurately, what deficiency in the procedures allowed the outage? Was there some failure mode which was not anticipated? Was a combination of failures not anticipated? Did management insist on a timeframe for the change which was too brief to allow following procedures carefully? Did budgetary pressures drive an inappropriate choice to reduce the amount of redundancy? Etc...

I truly hope that MN is looking not only at the proximal cause for this incident, but also at the procedural matters...
  by spidey3
 
Backshophoss wrote:IF lirr42's blog is correct,a contractor "screw-up" created the MN's "nightmare" shutdown that day.
For a forgotten wire connection not checked BEFORE the change over to the alt power source, you wind up
with a total system failure. :(
Right -- this is a typical failure case when doing power supply maintenance. Usually the risk is mitigated by having thoroughly procedures / checklists to ensure that the alternate is operational before shutting off the primary.

The breakdowns usually occur when the procedures are inadequate (or inadequately documented), or when schedule pressure from management causes skipping of procedural steps. Occasionally it is just sloppiness / laziness on the part of the actual worker -- but I find that this is rare...
  by RearOfSignal
 
Duplicate post.
Last edited by RearOfSignal on Mon Jan 27, 2014 12:33 pm, edited 1 time in total.
  by RearOfSignal
 
spidey3 wrote:
Backshophoss wrote:IF lirr42's blog is correct,a contractor "screw-up" created the MN's "nightmare" shutdown that day.
For a forgotten wire connection not checked BEFORE the change over to the alt power source, you wind up
with a total system failure. :(
Right -- this is a typical failure case when doing power supply maintenance. Usually the risk is mitigated by having thoroughly procedures / checklists to ensure that the alternate is operational before shutting off the primary.

The breakdowns usually occur when the procedures are inadequate (or inadequately documented), or when schedule pressure from management causes skipping of procedural steps. Occasionally it is just sloppiness / laziness on the part of the actual worker -- but I find that this is rare...
So then it is totally possible for all you know that MNR has an effective plan in place during testing and maintaining of such equipment so as not to cause failures as we had Thursday. So it might not be management to fault rather just an employee who didn't follow procedure for the outage. Further, we do not know the status of the equipment prior to failure and that such a condition may have necessitated the work being done at such an inconvenient time as the evening rush. Let's just be fair before we start saying how incompetent MTA management is.
  by spidey3
 
RearOfSignal wrote:So then it is totally possible for all you know that MNR has an effective plan in place during testing and maintaining of such equipment so as not to cause failures as we had Thursday. So it might not be management to fault rather just an employee who didn't follow procedure for the outage. Further, we do not know the status of the equipment prior to failure and that such a condition may have necessitated the work being done at such an inconvenient time as the evening rush. Let's just be fair before we start saying how incompetent MTA management is.
Yes - that is entirely possible - but I can tell you that in my experience, if a problem happens due to labor or contractor error, usually management takes at least some of the blame. Sometimes this is warranted, other times less so...
  by Backshophoss
 
Figure on a long search thru the maintaince/test/construction records to find a reason.
  by EM2000
 
Or it could be...as mentioned...simple human error. A few weeks ago, Amcrap,NJTrainslate and Wrong Island Railroad had a terrible day because a contractor cut into several communication lines, causing PSCC to lose control of various interlockings. CSX lost control of all of their signals and had a complete code line failure as well. These things happen from time to time.
The incident you are referring to occurred back in the fall. PSCC only lost control of F. If I recall correctly, only 32/33 bridge, F's easterly home signal's at the time, were dark, affecting westbound train's on Lines 4 and 2, during the afternoon/evening rush. BTW, how was CSX affected by a signal failure in F that day? None of this is relevant to anything to do with the MN forum though.
  by ThirdRail7
 
EM2000 wrote:
Or it could be...as mentioned...simple human error. A few weeks ago, Amcrap,NJTrainslate and Wrong Island Railroad had a terrible day because a contractor cut into several communication lines, causing PSCC to lose control of various interlockings. CSX lost control of all of their signals and had a complete code line failure as well. These things happen from time to time.
The incident you are referring to occurred back in the fall. PSCC only lost control of F. If I recall correctly, only 32/33 bridge, F's easterly home signal's at the time, were dark, affecting westbound train's on Lines 4 and 2, during the afternoon/evening rush. BTW, how was CSX affected by a signal failure in F that day? None of this is relevant to anything to do with the MN forum though.
I'm thinking of an incident that occurred in the morning and it impActed F, part of Harold as well as communications at Q and R. At no point did I say or mean to imply it affected CSX, but they did have a code line failure the same day. As for relevance, I listed them as examples of human error and things "just happening" since I personally do not believe such incidents necessarily mean there is poor management oversight, nor do I believe that all of these recent events are indicative of poor management and are connected in some way. Even with the best laid plans, sometimes things just happen.

Look how I mangled OCC above. I read it correctly but flipped it with OCS and responded as such. Simple human error. Thanks for the corrections.
  by EM2000
 
My apologizes ThirdRail, I don't work mornings and misunderstood what you meant by your CSX example.