The Principles of System Reliability

Uptime Engineering > blog > The Principles of System Reliability

Unlike living things, technical systems do not repair themselves. Rather, they break down, and the more complex they are and the more differently they are used, the more likely this is to happen.

Everything that exists can break.

That sounds trivial.

If this were the case, technical systems would be as simple as possible. However, analyzing current system architectures reveals that they are anything but simple. System architectures should be lean, simple, and modular in both hardware and software. However, we see a general trend toward integrated functional monsters with corresponding complexity. Moreover, every additional element of a system brings risks of failures and additional interactions with it.

The more complex a system, the more complex it becomes to identify, assess, and eliminate these risks. We see this, for example, in the exponential increase in software error frequency in the automotive industry. Eliminating all risks is fundamentally impossible for complex systems. This is likely the source of many of the problems with the autonomous operation of systems and vehicles.

The more complex a system, the more complex it becomes to identify, assess, and eliminate these risks. We can see this, for example, in the exponential increase in software error rates in the automotive industry. Eliminating all risks is fundamentally impossible for complex systems. This is likely the source of many of the problems with autonomous operation of systems and vehicles.

This also includes the rocket problem.

Even when a system is not operated, it can fail, for example, due to corrosion or aging. As long as it is not in operation, it’s difficult to determine whether it will work or not. From the electrical contact on the launch switch to the navigation software, a series of mechanisms must function during rocket launch. The product of even minor unreliabilities in this chain leads to the well-known problems of space travel.

Sensors are part of the system.

This must be considered in sensor concepts for system monitoring. Much of the system characterization is already available from the system control system. Sensor additions must be justified in each case, primarily by covering risks.

Sensors must, in any case, be significantly more reliable than the monitored system. Because this is often not the case, sensor errors contribute significantly to false alarms and erode user acceptance. Redundancy solves this problem only at high cost and with high maintenance effort.

Everything that can break will break.

Development projects can be viewed as a systematic process for risk elimination. Despite significant effort along the safety chain, some risks always survive. They only lead to failures during representative tests on the entire system and reveal what was previously unknown or misjudged. If this learning impulse comes in a timely manner, this is still the best case scenario. Otherwise, a priori unknown error mechanisms occur for the first time at the customer’s site, and in the worst case, as serial failures. There are numerous examples of this, too.

Consequences:

  • No unsupervised facilities with a potential risk should be created, e.g. no nuclear waste repositories without permanent monitoring.
  • Repairability and testability must be built into the system.
  • Modularity should be sought to eliminate weak points.
  • Risk-focused system monitoring should be part of the product architecture.
  • It can be technically and economically sensible to replace parts as a precautionary measure instead of optimizing them for maximum reliability and service life.

Reliability is a system property.

System reliability cannot be modularized.

The complexity of systems is usually addressed by breaking them down into subsystems or modules. Each module is designed, developed, manufactured and assembled to meet the specified requirements.

The reliability of the modules can also be proven to a certain extent in this way. Interactions can be dealt with at the appropriate integration levels of a system.

However, breaking down the system forces technical and organizational interfaces. The latter arise at the phase boundaries of the product life cycle. These interfaces act as information sinks. They therefore form the focal points of unreliability.

People are part of the system

When experts operate their system, everything runs “like clockwork”. However, only as long as they do not believe that they are smarter than the system. Otherwise the Chernobyl disaster would not have happened.

Practically every limitation of the scope of use of a system can be outsmarted because the security architecture was not created by a superior superintelligence. But obviously, no criminal energy is required. Normal maintenance activities relatively often produce a system state that does not correspond to the target. This leads to consequential damage or recurring failures, but sometimes also to catastrophes, such as the Three Mile Island reactor accident.

Users do something with the system that the designer did not think of, or they do it in a way that he did not think of. Then something goes wrong in the long run:

Luxury cars are idled in front of luxury hotels so that the air conditioning is constantly cooling. This sours the oil system and ruins the engine. People mainly get on the subway at the front or back. These doors open much more often and are therefore less reliable than their neighbors. Dirty laundry is piled on the open door of the washing machine. This causes it to leak.

Larger, long-lasting technical systems only work with a professional maintenance process. When experts leave a maintenance team, they take their implicit knowledge with them. Therefore problems arise afterwards that were previously unknown.

Consequences:

Uptime arrow icon

A system is more than the sum of its parts. It must therefore be developed and treated as a big whole throughout its life cycle.

Uptime arrow icon

Organizational interfaces in the life cycle must be linked organizationally

Uptime arrow icon

A system must be thought of from different user perspectives

Uptime arrow icon

Risks from unintended side effects of interventions in the system must be treated in a structured manner.

Uptime arrow icon

The maintenance process should be treated as a generator of system knowledge.