All service stopped

Discussion relating to the operations of MTA MetroNorth Railroad including west of Hudson operations and discussion of CtDOT sponsored rail operations such as Shore Line East and the Springfield to New Haven Hartford Line

Moderators: GirlOnTheTrain, nomis, Jeff Smith, FL9AC

RearOfSignal
Posts: 2553
Joined: Tue Aug 09, 2005 2:31 pm

Re: All service stopped

Post by RearOfSignal » Fri Jan 24, 2014 9:46 pm

spidey3 wrote:
RearOfSignal wrote:MNR has plenty of plans for extended signal outage, but it was quicker to fix problem downtown then to setup trains for signal outage. MNR implemented such a plan shortly after hurricane Irene destroyed tracks and signaling equipment.
No argument with that - but it is not what I am talking about.

I get that you all know rail operations from the inside, and with far greater detail than I do.
I concede that this is your territory.

But this was an IT failure, and as an IT pro, with 25 years experience running mission-critical systems, I can state with full confidence that this is my territory.

And based upon that experience I can tell you that the fact that the computers went down at all is a major failure on the part of the IT hardware folks, and a black eye for their managers.
And how have you and/or your employer dealt with incidents such as we had last night?
Hurry up and wait at the signal!

User avatar
Tadman
Posts: 9548
Joined: Wed Sep 01, 2004 10:21 am

Re: All service stopped

Post by Tadman » Fri Jan 24, 2014 9:56 pm

Spidey, I think there's a difference between your company - probably a for-profit privately owned company - and a commuter railroad. The commuter railroad funds half their operating costs and none of their capital costs from revenue. The rest is from the perpetually broke gov't. Compare this with your company which (I sure hope) funds every cost plus 5-10% profit margin from revenue, affording them the opportunity to buy proper IT redundancy.
Dig the new rr.net Instagram account: @railroad_dot_net

ThirdRail7
Posts: 4090
Joined: Sat Oct 01, 2011 7:07 pm

Re: All service stopped

Post by ThirdRail7 » Sat Jan 25, 2014 6:58 pm

RearOfSignal wrote:
runningwithscalpels wrote:Third Rail, the OCC isn't very old at all.
It was just a couple if years ago that they upgraded the whole OCC.
Then Metro-North must have done some hyper-installation. I haven't been up there in a few months but the last time I was, I remember seeing old catenary between CP244 and CP248 and I also remember seeing some between CP257 and CP261.
spidey3 wrote:
Ridgefielder wrote:If they indeed did not have a backup power supply for the OCC, that is absolutely a management failure, and heads should roll at 347 Madison.
From the latest info it appears that they do in fact have dual redundant power supplies, but that due to human error (either yesterday or at installation time) the redundancy was ineffective. My read is that they needed to replace the primary (probably were seeing imminent fault warning, voltage drift, etc.) - but didn't verify that the alternate supply was properly connected and operational.

There are two major failings indicated:
1) Lack of regular testing [or insufficient testing plan] for power outage scenarios
2) Insufficient planning / prediction of potential breakdowns for the power supply replacement task

Both of those are indicative of poor management oversight.
Or it could be...as mentioned...simple human error. A few weeks ago, Amcrap,NJTrainslate and Wrong Island Railroad had a terrible day because a contractor cut into several communication lines, causing PSCC to lose control of various interlockings. CSX lost control of all of their signals and had a complete code line failure as well. These things happen from time to time.
I want my road foreman!

DutchRailnut
Posts: 22257
Joined: Thu Mar 11, 2004 8:02 pm
Location: released from Stalag 13

Re: All service stopped

Post by DutchRailnut » Sat Jan 25, 2014 7:06 pm

OCC has nothing to do with catenary, Operations Control Center aka Dispatchers room.
If Conductors are in charge, why are they promoted to be Engineer???

Retired Triebfahrzeugführer. I am not a moderator.

alewifebp
Posts: 1021
Joined: Wed Jul 21, 2004 11:03 pm
Location: WORMland

Re: All service stopped

Post by alewifebp » Mon Jan 27, 2014 1:21 am

I'm in IT as well, and I chimed in with similar thoughts when the NJT outage occurred. I get what is being said about following proper procedures, and having a proper plan in place and the correct disaster recovery model functioning. I'm sure they will review procedures and make changes where necessary. But before we rake anyone over the coals, let's also remember something that happened just this last week. Several Google services were down for some time. This outage even resulted in some poor fellow receiving thousands of e-mail messages. One would assume that Google has a downright excellent IT department and nearly unlimited budget. They got it wrong. Similar problems have happened in the past to other tech titans, and will continue in the future.

Backshophoss
Posts: 6318
Joined: Mon Mar 05, 2012 7:58 pm

Re: All service stopped

Post by Backshophoss » Mon Jan 27, 2014 2:59 am

IF lirr42's blog is correct,a contractor "screw-up" created the MN's "nightmare" shutdown that day.
For a forgotten wire connection not checked BEFORE the change over to the alt power source, you wind up
with a total system failure. :(
"The Devil is in The Details"......
The Land of Enchantment is not Flyover country!

spidey3
Posts: 97
Joined: Fri Jun 04, 2010 12:11 pm

Re: All service stopped

Post by spidey3 » Mon Jan 27, 2014 10:44 am

RearOfSignal wrote:
spidey3 wrote:...But this was an IT failure, and as an IT pro, with 25 years experience running mission-critical systems, I can state with full confidence that this is my territory.

And based upon that experience I can tell you that the fact that the computers went down at all is a major failure on the part of the IT hardware folks, and a black eye for their managers.
And how have you and/or your employer dealt with incidents such as we had last night?
There have been a few, but not many. It should be possible to switch power supplies, or even switch from a primary to a backup server, without interrupting service. In most cases we have been able to maintain that level of service. The key is to plan for outages, have redundant systems in place so that outages are kept localized to the system with the fault, and have documented procedures for dealing with outages (planned or unplanned).

In a few cases, I have experienced outages due to redundant systems which didn't work as expected in the face of an outage, or which caused an interruption of service during maintenance. Some of those events have had wider reaching consequences. In those cases, the follow-up is to investigate not only the proximal cause of the problem, but also the root causes: Was a procedure not followed correctly? Were shortcuts taken which should not have been? Where the proper procedures to avoid outage documented correctly? Is the procedure too complicated? If the procedures were documented correctly and followed accurately, what deficiency in the procedures allowed the outage? Was there some failure mode which was not anticipated? Was a combination of failures not anticipated? Did management insist on a timeframe for the change which was too brief to allow following procedures carefully? Did budgetary pressures drive an inappropriate choice to reduce the amount of redundancy? Etc...

I truly hope that MN is looking not only at the proximal cause for this incident, but also at the procedural matters...

spidey3
Posts: 97
Joined: Fri Jun 04, 2010 12:11 pm

Re: All service stopped

Post by spidey3 » Mon Jan 27, 2014 10:56 am

Backshophoss wrote:IF lirr42's blog is correct,a contractor "screw-up" created the MN's "nightmare" shutdown that day.
For a forgotten wire connection not checked BEFORE the change over to the alt power source, you wind up
with a total system failure. :(
Right -- this is a typical failure case when doing power supply maintenance. Usually the risk is mitigated by having thoroughly procedures / checklists to ensure that the alternate is operational before shutting off the primary.

The breakdowns usually occur when the procedures are inadequate (or inadequately documented), or when schedule pressure from management causes skipping of procedural steps. Occasionally it is just sloppiness / laziness on the part of the actual worker -- but I find that this is rare...

RearOfSignal
Posts: 2553
Joined: Tue Aug 09, 2005 2:31 pm

Re: All service stopped

Post by RearOfSignal » Mon Jan 27, 2014 12:30 pm

Duplicate post.
Last edited by RearOfSignal on Mon Jan 27, 2014 12:33 pm, edited 1 time in total.
Hurry up and wait at the signal!

RearOfSignal
Posts: 2553
Joined: Tue Aug 09, 2005 2:31 pm

Re: All service stopped

Post by RearOfSignal » Mon Jan 27, 2014 12:32 pm

spidey3 wrote:
Backshophoss wrote:IF lirr42's blog is correct,a contractor "screw-up" created the MN's "nightmare" shutdown that day.
For a forgotten wire connection not checked BEFORE the change over to the alt power source, you wind up
with a total system failure. :(
Right -- this is a typical failure case when doing power supply maintenance. Usually the risk is mitigated by having thoroughly procedures / checklists to ensure that the alternate is operational before shutting off the primary.

The breakdowns usually occur when the procedures are inadequate (or inadequately documented), or when schedule pressure from management causes skipping of procedural steps. Occasionally it is just sloppiness / laziness on the part of the actual worker -- but I find that this is rare...
So then it is totally possible for all you know that MNR has an effective plan in place during testing and maintaining of such equipment so as not to cause failures as we had Thursday. So it might not be management to fault rather just an employee who didn't follow procedure for the outage. Further, we do not know the status of the equipment prior to failure and that such a condition may have necessitated the work being done at such an inconvenient time as the evening rush. Let's just be fair before we start saying how incompetent MTA management is.
Hurry up and wait at the signal!

spidey3
Posts: 97
Joined: Fri Jun 04, 2010 12:11 pm

Re: All service stopped

Post by spidey3 » Mon Jan 27, 2014 12:54 pm

RearOfSignal wrote:So then it is totally possible for all you know that MNR has an effective plan in place during testing and maintaining of such equipment so as not to cause failures as we had Thursday. So it might not be management to fault rather just an employee who didn't follow procedure for the outage. Further, we do not know the status of the equipment prior to failure and that such a condition may have necessitated the work being done at such an inconvenient time as the evening rush. Let's just be fair before we start saying how incompetent MTA management is.
Yes - that is entirely possible - but I can tell you that in my experience, if a problem happens due to labor or contractor error, usually management takes at least some of the blame. Sometimes this is warranted, other times less so...

Backshophoss
Posts: 6318
Joined: Mon Mar 05, 2012 7:58 pm

Re: All service stopped

Post by Backshophoss » Tue Jan 28, 2014 1:31 am

Figure on a long search thru the maintaince/test/construction records to find a reason.
The Land of Enchantment is not Flyover country!

EM2000
Posts: 196
Joined: Mon Jun 06, 2011 8:43 pm

Re: All service stopped

Post by EM2000 » Tue Jan 28, 2014 4:13 am

Or it could be...as mentioned...simple human error. A few weeks ago, Amcrap,NJTrainslate and Wrong Island Railroad had a terrible day because a contractor cut into several communication lines, causing PSCC to lose control of various interlockings. CSX lost control of all of their signals and had a complete code line failure as well. These things happen from time to time.
The incident you are referring to occurred back in the fall. PSCC only lost control of F. If I recall correctly, only 32/33 bridge, F's easterly home signal's at the time, were dark, affecting westbound train's on Lines 4 and 2, during the afternoon/evening rush. BTW, how was CSX affected by a signal failure in F that day? None of this is relevant to anything to do with the MN forum though.

ThirdRail7
Posts: 4090
Joined: Sat Oct 01, 2011 7:07 pm

Re: All service stopped

Post by ThirdRail7 » Tue Jan 28, 2014 6:58 am

EM2000 wrote:
Or it could be...as mentioned...simple human error. A few weeks ago, Amcrap,NJTrainslate and Wrong Island Railroad had a terrible day because a contractor cut into several communication lines, causing PSCC to lose control of various interlockings. CSX lost control of all of their signals and had a complete code line failure as well. These things happen from time to time.
The incident you are referring to occurred back in the fall. PSCC only lost control of F. If I recall correctly, only 32/33 bridge, F's easterly home signal's at the time, were dark, affecting westbound train's on Lines 4 and 2, during the afternoon/evening rush. BTW, how was CSX affected by a signal failure in F that day? None of this is relevant to anything to do with the MN forum though.
I'm thinking of an incident that occurred in the morning and it impActed F, part of Harold as well as communications at Q and R. At no point did I say or mean to imply it affected CSX, but they did have a code line failure the same day. As for relevance, I listed them as examples of human error and things "just happening" since I personally do not believe such incidents necessarily mean there is poor management oversight, nor do I believe that all of these recent events are indicative of poor management and are connected in some way. Even with the best laid plans, sometimes things just happen.

Look how I mangled OCC above. I read it correctly but flipped it with OCS and responded as such. Simple human error. Thanks for the corrections.
I want my road foreman!

EM2000
Posts: 196
Joined: Mon Jun 06, 2011 8:43 pm

Re: All service stopped

Post by EM2000 » Wed Jan 29, 2014 3:44 am

My apologizes ThirdRail, I don't work mornings and misunderstood what you meant by your CSX example.

Return to “MTA Metro-North Railroad and CtDOT Passenger Rail”