HOME - MEMBERSHIP - DataBus
Databus Issue: 2003 4 10/02/2003

Fault Tolerance in Information System Design

Al Foytek Information Systems
PDF

Fault Tolerance is a technique used to increase the “dependability” of a system, while dependability is the probability that it will be both available when needed and reliable as long as needed1.

Many non-critical systems do not need a lot of “dependability”. If only a few people use the system and the timelines for its use have a lot of slack in them, then, indeed, it does not need to have high dependability.

Fault tolerant implies a system design, which tolerates faults. It does not mean a system, which never has faults. If the system never fails, it is perfectly dependable and needs no fault tolerance, since it never has faults. I have not found such a system yet, but you get the idea. A system, which has faults, but keeps on working, is fault tolerant.

High dependability is needed for systems, which are critical. Critical systems would obviously include any involving safety like air traffic control, emergency management, navigation, etc. Critical business information systems are those with a relatively high importance to the school district’s mission, i.e. educating children. Systems with a large number of users, ones which support time critical functions, and those doing important tasks for the district would be considered critical. Criticality might be measured in terms of the number of instructional hours likely to be lost, potential economic losses (labor or fines) or in potentially harmful, illegal action, like releasing a child to someone without proper custodial authority. Some examples of critical systems are:

• Instructional systems required to meet curriculum requirements and shared by many students.
• System used to track student attendance, demographics, emergency information, which has custodial authority, etc. (i.e. the student information system).
• System used for financial budget, accounting, payroll, purchasing, and so on. (i.e. the financial information system).
• System used for employee position tracking, demographics, job information, and so on. (i.e. the human resources system).
• District-wide systems used to provide basic services for communications, information retrieval, and archiving like email/calendaring, Internet service and instructional media retrieval.
• District-wide area network and large local area networks used as communications infrastructure for information systems and services.

By understanding the need for dependability, the system designer can plan for this early in system development with appropriate concepts, like fault tolerance.

System designers are often faced with the need to meet opposing requirements. Requirements like, “Keep cost to a minimum, but it must be dependable 99.9 percent of the time.” High dependability and low cost are normally every end-system user’s expectation.

What are system faults? Hardware failures are faults and, of course, software errors (bugs) are faults. But, how about:
• Bad documentation – errors, undocumented features, wrong version information, on-line but server is usually down.
• Incorrect or inadequate operations procedures -- the ones you write.
• Supporting environment infrastructure – electrical, air conditioning, facility (roof leaks), etc.
• Untrained or poorly trained operator – causes system faults by taking inappropriate actions.
• Poor or untrained support – “it can be fixed but no one knows how” and/or “we did not buy vendor support but need help to upgrade.”
• Malicious acts possibly allowed through “security flaws” – in procedures, system capabilities, system configuration (did not activate adequate features).
• Deficiencies in design or overuse – e.g. performance failure due to saturation by use overload. Unanticipated growth in use (a requirement fault), poor interpretation of requirements (a specification fault), or poor system implementation (design fault).

The potential for our information systems to have a fault can come from many places. Modern information systems include hardware, software, a facility supporting infrastructure, network (communications), support personnel, technical data, and the operators. Figure 1 illustrates these relationships.


An exhaustive treatment of how to make each of these information system components fault tolerant is beyond the scope of this paper. However, we will touch on some of the obvious and not so obvious ways to increase information system fault tolerance below.

Computer Systems and Network Hardware Fault Tolerance -- A common method of providing fault tolerance in hardware is redundancy (see Figure 2 below). Dual power supplies, dual fans, multiple disk drives, etc. This is a good method and works well to hold cost down while pushing dependability up. However, redundancy may not always be effective.

Some hardware equipment faults are predicted by a number of cycles. One example is the laser printer. Many laser printer parts wear with each page printed and failure of these components is easily predicted by the number of pages printed (page print cycles or duty cycles). Thus, if you bought two printers and ran them equally, they could be expected to fail at the same time, leaving you with no printing capability! The solution here is procedural, either stagger use or save one printer as the backup.


A second method of providing fault tolerance is providing for degraded operation or alternate ‘modes’ where a fault does not cause the system to fail completely. A system which allows a component to fail, like a portion of memory, but continues to function with reduced performance is an example.

Software Fault Tolerance – Unlike hardware, redundancy in software usually does not produce increased fault tolerance since faults are caused by errors in coding and the same coding errors are usually in each software product sold. One possible way around this is to have different teams develop the same software without any corroboration. This approach is used for critical space shuttle software. This helps, but does not provide certainty since humans tend to err in the same areas. For information system managers, a way to implement this approach is to purchase the same type of software from multiple, competing vendors. Unfortunately this is often not practical due to cost, training, and other issues. Proper research to obtain a high quality product with good vendor support is often the best hedge against software faults.

Fortunately, software errors often do not cause total failure of a system. There may even be an alternative feature that can perform the same function or nearly so. These are more examples of degraded or alternate mode operation.

Personnel Fault Tolerance – There is no substitute for smart people with good training. This applies equally to support and operations personnel. Select them carefully, encourage them to grow professionally and keep them trained. For critical systems you need adequate cross-training (redundancy) for times when one person is on vacation, one gets sick, and another’s car breaks down.

Technical Data – Technical data includes manufacturer information as well as district developed procedures and information. Procedures can mean the difference between a malicious person being blocked from accessing your system or not. Procedures (system configuration) can also be the difference between using important, fault reducing features or not. For example configuring a server to use the NTFS5 file system will give it encryption and better security features while selection of earlier file systems will not.

Facility – Critical systems need protection from fire, water, electrical surges/sags, over heating, and physical abuse. Raised floors with water leak detection are still appropriate for some systems. Fire suppression systems, which do not damage electronic equipment, should be provided. Power conditioning and short-term uninterruptible power supplies are always needed for critical systems and, in some cases, backup generators may be appropriate. Air conditioning is generally required for most components and heat sensors with alarms may be needed.

Security – Good security is needed for all critical systems. Security involves people, procedures, system function, configuration, location of components, and operation. Security flaws often result in information system faults, sometimes catastrophic faults. Good security also requires limiting physical access to those who will know how to be careful in and around critical systems.

References:
(1) TechSETS Technology Infrastructure Development Guide, Al Foytek.
(2) Software Reliability, J. Musa, A. Iannino and K. Okumoto.



Upcoming Events

Spring and Summer Webinars 2012
04/12/2012 - 07/12/2013

Annual Conference 2012
10/16/2012 - 10/19/2012
Monterey, California

Annual Conference 2013
11/19/2013 - 11/22/2013
Pasadena, California

Annual Conference 2014
11/18/2014 - 11/21/2014
Sacramento, California