Dealing with the “bad actors” problem

Abstract

Monitoring and analyzing data for predictive maintenance assumes that the asset condition is contained in the load response and that the variance of the load capacity is small. The risk of failure increases as the cumulative load approaches the load capacity. With appropriate models and load history data, a good estimate of the lifetime of the instances can be obtained. This concept can therefore be used very effectively for predictive maintenance. The accuracy of the prediction is essentially limited by the variance of the component properties.

If unplanned outages occur prematurely and are concentrated on a small percentage of instances, then the assumptions are no longer valid, and we are dealing with the “bad actors” problem. In this case, there are hidden factors that are not reflected in the concepts above. The primary task is then to determine the causes of the concentration of failure risk.

There are several causes for premature, concentrated failures. Their treatment is explained below.

Incorrect loading

This can lead to damage because units are primarily designed for foreseeable extreme loads of a mechanical, thermal, electrical, or chemical nature. While their testing should protect them against the corresponding variety of damaging effects, in practice this is hardly realistic for the entire operating range. This is primarily risky for the auxiliary units of large-scale plants because, depending on the actual operating conditions, they may operate far beyond their design limits. Accordingly, unexpected effects occur, such as cold-running-induced oil dilution or sooting, relay erosion or bearing wear due to high switching frequencies, and corrosion caused by unforeseen media or downtime.

Tasks

Damage analyses should generally consider the load aspect. This may require a measurement campaign or permanent sensor technology on the critical components. The measurement data obtained in this way not only serves to verify the cause but also to determine the critical operating modes of the system. This makes it possible to determine which restrictions ensure gentle operation or how a component should be designed to withstand the load.

Dynamic Operation

Load dynamics can lead to unintended, damaging loads if the load responses of individual components lag behind the loads (e.g., heating and cooling delays). This results in transient deviations from the target values of temperature, pressure, voltage, etc. Because these deviations usually occur locally in highly stressed component zones, they exacerbate, for example, thermally induced stress or wear. Transient cavitation of pumps or cooling systems or the side effects of switching events should also be mentioned here.

Tasks

In many cases, the system load is simulated. However, the dynamics of systems cannot be determined with sufficient accuracy, primarily because the input data for model calibration is often missing. However, they can generally be measured relatively quickly, e.g. in the form of heating curves, inductive switching voltages, etc. Using the results of the calibrated models, the dynamics are then limited so that the damaging effect of transient overloads disappears.

Scattering of Component Load Capacity

Components survive operation as long as their cumulative load is smaller than their load capacity. If the load capacity is grossly miscalculated, all components fail. We are dealing with a “teething problem” that requires a design change. This is the worst case scenario, which we are not considering here.

With limiting designs, however, only the poorest quality components fail prematurely. Instead of a defective part, a replacement part with a higher load capacity is often installed. The problem thus disappears because operation eliminates the poorest components over time. However, it can take a long time until all “bad actors” are eliminated. Therefore, availability remains limited, and the repair rate remains high.

Tasks

The time course of the failure rate provides a guideline. It must decrease over time because the proportion of poor-quality components decreases. This is analyzed using a Weibull analysis of component lifetimes to then implement the appropriate quality measures. In practice, this approach is lacking when it is not immediately profitable—for example, when component testing is expensive, but defective components can be replaced inexpensively. Overall, however, high hidden costs can arise due to downtime and its consequences.

Manufacturing quality of the system

The quality of a system depends not only on the individual component but also on the entire process chain, including assembly and installation. Installing systems consisting of multiple components (railway switches, industrial systems) is a challenging task due to their dimensions and weight alone. Accordingly, the properties of systems vary due to differences in embedding, tolerance chains, and signal connections. Boundary situations constantly lead to incorrect loading and excessive damage. Replacing parts does not address these causes because the damaged parts are merely the victims of the causal chain of improper installation, and the new parts are therefore subjected to the remaining incorrect loading.

Tasks

System monitoring often only begins when a problem occurs or after the warranty period has expired. It is more efficient to monitor the systems from the very beginning of installation. Their initial load responses are excellent indicators of poor installation quality. Acceptance tests often only provide limited coverage for the specific test cases. In addition, the load history of suspected instances should be monitored, as it allows conclusions to be drawn about the nature of the impact of a defective installation (mechanical, thermal, electrical, chemical load capacity).

Temporal Stability of a Plant

After a run-in phase, the properties of a plant should stabilize. However, individual entities (e.g., a wind farm, a vehicle fleet, etc.) often change their behavior monotonically or even “stochastically.” The causes for this are correspondingly monotonous in nature, such as the settlement of a plant’s subsoil, or temporally variable effects such as control problems. Even if such changes do not directly lead to failures or malfunctions, they exacerbate stress and damage, which in any case reduces the service life of the units.

Tasks

Changing conditions or excitations often lie outside the scope of plant monitoring. They can therefore be detected indirectly via trends, drifts, and ranges of the measured variables. Therefore, global parameters such as efficiency, load profile, oil temperature, etc. should be widely monitored, even if they cannot be directly attributed to a specific failure risk.

The effects described are encountered in practice. They often dominate failure events and sometimes also maintenance. Accordingly, they should be treated with priority as outlined above. The business case for comprehensive monitoring and analysis of plant data can still be attractive. However, it should be considered only afterwards.