Network equipment failure prediction

Network equipment maintenance activities represent one of the biggest expense factors for any Telecommunication Service Providers. Operators are continuously trying to optimize these activities in order to achieve a significant OPEX reduction without impacting service quality. One of maintenance activities’ major cost components is the rectification work needed when an equipment fails

This article explains the way predictive analytics models may actually be used to predict network equipment failures and to reduce rectification activities associated with these failures.

Introduction

Network equipment reliability is one of the mayor critical issue for telecommunication operators, not only because it may affect the company`s outlook – in terms of customer service and reputation – but also because the repairing activities can be extremely expensive.

Analytics and big data can be used to optimize network maintenance operations. Specifically, the use of predictive analysis to predict network failures allows telecom operators to: 

  • achieve better-planned on-field technicians’ activities and to increase prevention, thus reducing repairing
  • Increase technicians` productivity (reducing average number of interventions per technician, repairing time, commuting time, etc.), and optimize technical resource and skill pools. 
  • improve spare parts tracking, reduce unnecessary cards replacements and optimize spare parts warehouse

This article focuses on reduction of corrective maintenance without impacting services levels. 

Current situation: how operators manage network equipment maintenance operations without the use of predictive model

The aim of maintenance operations is to achieve a high level of network reliability, that is, to minimize the network equipment downtime.

Network maintenance activities are grouped in two main categories:

  1. Preventive Maintenance (PM) Standard planned and proactive equipment inspection, needed to prevent breakdown and failures and to enhance equipment reliability
  2. Corrective or Breakdown Maintenance (CM). On-demand, reactive repair of broken-down equipment, to bring it back to original operating condition.

Maintenance costs are the combination of Preventive Maintenance and Corrective Maintenance. The following figure shows how these costs change with the level of maintenance.

Figure 1 – Maintenance costs versus Level of Maintenance (Note: the costs shown in the picture are just an exemplification)

Increasing the level of preventive maintenance produces costs benefits as it reduces end-of-life component failures. Unfortunately, PM can’t eliminate all failures. Over a certain point, increasing PM doesn’t produce any cost benefits. This is the so-called CM/PM optimization zone, where most part of Service Providers are or should be.  

To improve the CM/PM rate, Service Providers must develop new techniques capable of “anticipating” equipment failures; see the failures before they happen. Predictive analytics can be the solution.

Network equipment failure prediction model

Regression, Survival and other analytics models allow to “anticipate” faults since they can predict the probability of network equipment (or their-components) failure over a certain period. Using this information, Service Providers can plan a preventive visit before the fault occurs.

We are inclined to think that a predictive model that provides predictions with 85% accuracy (precision = rate between correct predictions versus all predictions) will save us 85% of the corrective visits. That’s not true. Predictive analysis are powerful tools but we must be carefully on the expected results.

To understand the expected saving due the use a predictive model, we should understand how the statistical predictions work. A fault prediction analytic model is a function which assigns the equipment to a “Fault” or a “no-Fault” class based on predefined rules.

In the real world, the equipment is broken or working, so it belongs to one of the classes with 100% probability (100% probability = equipment is broken, 0% probability = equipment is working). Therefore, the two classes have only one possible value of probability: 0% or 100%, as shown in picture 2a.

In a predictive model situation, instead, we will generate two probability distribution diagrams (figure 2b) representing the probability that an equipment belongs to a Fault class (red) or to a No-Fault class (blue).  

The position, amplitude and shape of the curves depend on the reliability of prediction parameters, but the two distributions always show an overlapping area. This area indicates situations where both Fault and No-Fault situations are possible. For example, in figure 2b, at probability=0.55 we have 67 equipment in failure but also 20 No-Fault cases. In the distribution overlap area, the model cannot distinguish a Fault from a No-Fault situation.

To distinguish a Fault from a No-Fault situation during the model construction, a reference probability value is chosen (green line in picture 2b). This value is called cutpoint or threshold. Above the cutpoint is considered a fault, below it is a No-Fault. The threshold tells the model what is fault and what not. Picture 2c shows how the predictive model works.

The introduction of the threshold values solves the issue of overlap area, but introduces prediction errors. Indeed, ALL values below the threshold are classified as Fault, even if we know that some of them are No-Faults cases, (i.e. the tail of the blue distribution). Likewise, among cutpoint or higher results (No-Fault zone) there may be some Fault cases (red distribution tail).

Consequently, the model can produce four possible results for each prediction:

  1. True Positive: the system predicts Fault (Positive) AND the prediction is correct (True). In the picture, this is the blue area above the threshold
  2. True Negative: the system predicts No-Fault (Negative) AND the prediction is correct (True). In the picture, this is the red area below the threshold
  3. False Positive: the system predicts Fault (Positive) AND the prediction is incorrect (False). In the picture, this is the red area above the threshold
  4. False Negative: the system predicts No-Fault (Negative) AND the prediction is incorrect (False). In the picture, this is the blue area below the threshold

The result matrix is show in the following table, where the green cells are the correct prediction, while the red ones are the wrong predictions

When we apply these considerations to the problem to optimize Service Providers network maintenance activities we obtain the following situations:

It is important to observe that the prediction models have a side effect on the maintenance visits, indeed, As a result of the above, we can have the following

  1.  as expected, it reduces the number of Corrective visits. The True Positive predictions allows to replace Corrective visits with preventive/predictive ones.
  2. unexpectedly, it increases the overall number of visits. The False Positive predictions generate additional Preventive visits. These visits are due to faults that are predicted but that will not happen.

Additional preventive visits can jeopardize the cost benefits obtained by the True Positive prediction. This means that the prediction model must be built in a way that False Positive are minimized.

Analyzing figure 2b, False Positive (the red over- the-threshold area) can be minimized choosing a high value for the threshold (i.e. above 0.8). But increasing the threshold value reduces the number of True Positive (blue area above the threshold). False Positive are inverse proportional to True Positive. Figure 3 shows a typical example of the relationship between TP and FP.

Figure 3 – True Positive Rate versus False Positive Rate for different possible threshold values of the predictive model

So, the requirement of having few False Positive impacts model ability to identify Faults (True Positive). The model will be able to “predict” only a subset of faults.

The False Positive requirement is one of the several “constrains” the predictive model must satisfy. There are others “constraints” like:

  • Classes imbalance between faults and no-faults (there are more equipment working than in failure) 
  • Data quality and availability of “good” predictors
  • Site distribution (increase the precision of the model for remote sites)
  •  …

Even if there are several statistical techniques to address these constraints (i.e. of cost sensitive learning), the final result is that the model will be able to predict only a part of faults. A percentage that is far from the model`s nominal accuracy.

Prediction-based network maintenance operations

To analyze the benefits of using predictive model to improve network maintenance activities, we can apply the above-discussed model in a Telecom Operator`s real case.

The above model is a low-complexity network equipment failure predictive model and has the following characteristics:

  • Accuracy = 80-85%
  • Specificity = Above 90% (we have privileged the FP reduction over precision (TP)
  • Sensitivity = 35-40%
  • Prevalence = 8-10% (Average of 8-10 failures over 100 sites during the period)
  • Error rate = 8-15%

For sake of simplicity, we will use the model over a group of 100 sites to predict next period (4 weeks) activities. In a standard situation, Operators perform:

  • 9 corrective visits
  • 12 preventive visits

Applying the model, we obtain the following values:

The number of corrective visits have been reduced from 9 to 6 (33% savings) but we have used 6 preventive visits to correct faults which didn’t existed. This represent a loss of efficiency (around 5-10%) in the preventive maintenance planning. Overall, the saving achieved is 20- 24%.

A Real Case

The predictive model presented in the above scenario is quite simple, but it can be refined to improve its prediction capabilities. In the picture below a savings scenario is presented for a 3-year period, with a realistic and complex prediction model.

Figure 4 – Real case of use of predict model to optimize CM and PM