HOME - MEMBERSHIP - DataBus
Databus Issue: 2003 4 10/02/2003

Fault Tolerance: An organizational perspective

Mike Caskey CETPA President
To find a fault is easy; to do better may be difficult. PDF

It is highly likely that everyone reading this article (I won’t call it a diatribe just yet) has experienced the “blue screen of death.” And it probably occurred while you were trying to do something important. And who hasn’t had the pleasure of a “so-sad-too-bad” phone call? You know… “Gee I’m sorry, but the person you need to speak with is on vacation. Please call back next year.” According to webopedia.com, (www.webopedia.com/TERM/fault_tolerance.html), fault tolerance is:

The ability of a system to respond gracefully to an unexpected hardware or software failure. There are many levels of fault tolerance, the lowest being the ability to continue operation in the event of a power failure. Many fault-tolerant computer systems mirror all operations -- that is, every operation is performed on two or more duplicate systems, so if one fails the other can take over.

Above, the first example of failure falls within the boundaries of this definition – potentially recoverable hardware/software failure. It would appear that the second example, being an example of organizational failure, does not fall within the definition’s purview. I believe it should.

At one of the CEDPA conferences a few years ago, Gopal Kapur presented an entertaining and thought provoking address entitled “System, system, who’s got the system?” One of the presentation themes, which grabbed my attention, was the point that a system is made up of many components, not just hardware and software. Other components include the users of the system, the “keepers” of the system, the paths of interaction and communication between these components, and the application rules, such as GAAP for accounting systems. If a system is to remain responsive under any and all conditions, shouldn’t all components be fault tolerant?

Hardware and software engineers alike are working 24/7 to produce products that are increasingly fault tolerant. You can put together a computer system that will withstand “hits” from a lot of different sources and continue to function. How about your organization? Do you have people in the organization without whom certain functions will cease? If so, are these functions critical to the mission of your organization? Are there redundant paths of interaction and communication throughout your systems? How many in your organization know the application rules for the various systems in place?

In today’s socio-political climate in California, systems -- which are responsive and accurate -- will help reduce the negative impressions of those that must interact with the systems. Face it, bad press, disgruntled employees or anyone “with an axe to grind” can place the organization under microscopic scrutiny and cause major disruptions. So, how do we mitigate such reactions? Dust off your systems analysis skills and develop a workflow diagram of your various systems, most especially the critical systems and those with which the public may come into contact. Identify the nodes without built-in redundancy. These nodes may be hardware, software, communications paths, internal customers, external users, or I.T. staff. Develop ways to handle the failure of a vulnerable node. Provide technical and application training for the staff. Put your recently retired equipment back to work to provide system redundancy. Cross-train the I.T. staff. Expensive? Possibly. Worthwhile? Absolutely!!

Years ago, most stereo buffs built their systems from components. The strength of such systems was directly proportional to the weakest component. All systems we deal with are made up of many components. Failure of any component may mean the failure of the system. The definition of “Fault Tolerance” should be, “the ability of a system to respond gracefully to an unexpected failure”. We can make that happen by building and operating Fault Tolerant Organizations.


Upcoming Events

Spring and Summer Webinars 2012
04/12/2012 - 07/12/2013

Annual Conference 2012
10/16/2012 - 10/19/2012
Monterey, California

Annual Conference 2013
11/19/2013 - 11/22/2013
Pasadena, California

Annual Conference 2014
11/18/2014 - 11/21/2014
Sacramento, California