Predicting SLA Breakdowns

This tutorial describes how Iotellect can help proactively manage devices, services and processes. This is done by alerting the operators if a numeric SLA threshold will be breached in the near future based on the dynamics of a statistical trend of a chosen metric.

Here are some examples of alert configurations:

  • Raise an alert if HDD load on a critical server will reach 95% in two weeks

  • Raise an alert if the average availability of a multi-component business service will go below 99% in the next three months.

In our example we'll set up an alert that will be triggered if the average CPU load on a server is predicted to go over 80% within a week.

Creating Statistical Channel

First of all, we need to set up a statistical channel for calculating daily CPU load averages on our server.

To create a statistical channel select Edit Device Properties from popup menu of a network device, switch to Statistical Channels tab (1) and add a new row to the channels table (2).

We set the Variable to point to the variable containing the hrProcessorLoadTable value. Open channel Parameters (4) and set up the Expression that will return numeric values those SLA will be monitored by the alert.

From the Parameters page, we set the expression {hrProcessorTable} to provide the exact value of the core hours being used by the processor on our target device (1).

It's also a good idea to disable Storage Periods for all archive types except for Daily, since we'll use only daily data for analysis. To do this, right click inside any storage period and select Remove Value from context menu (2).

Configure the length of the Daily archive to 14 ensure that the prediction will be based on statistical data of sufficient length.

Creating an Alert

Create a new alert that will be raised upon a future SLA breach detection. Set up alert's notifications (such as mail or SMS sending) and an appropriate warning message.

 

Add one Variable Trigger to the alert, and set it up as follows:

  • Context Mask: users.admin.scripts.slaBreakdown (SLA Breakdown Date Calculator script)

    • If you can’t find this script, it can be added through the Add Resources option of your Iotellect server.

  • Variable: childInfo (Properties)

  • Expression:

    dateDiff(now(), 
    cell({users.admin.scripts.slaBreakdown:execute("users.admin.devices.criticalServer", "cpuLoad", 4, 0, 80)}),
    "day") < 7
    &&
    dateDiff(now(),
    cell({users.admin.scripts.slaBreakdown:execute("users.admin.devices.criticalServer", "cpuLoad", 4, 0, 80)}),
    "day"
    ) > 0
  • Mode State (True/False)

  • Delay: 0 ms

The Check Period should not be very short, since the statistics-based SLA breakdown date calculation is a resource-consuming task. 10 minutes or longer Check Period should be sufficient.

The core of this alert is its expression. This expression refers the SLA Breakdown Date Calculation script that is a part of Iotellect distribution. The reference {users.admin.scripts.slaBreakdown:execute("users.admin.devices.criticalServer", "cpuLoad", 4, 0, 80)} calls execute function from this script's context and passes the following parameters to it:

The script will return a date when daily averages trend will cross a 80% CPU load baseline. However, the date is returned as a Data Table (since our reference calls a context function, and context functions always return Data Tables), so we need to use cell() expression language function to extract the actual cell value (i.e. the expected SLA breach date).

Finally, we use the dateDiff() expression language function to return true if the SLA breach will occur during the next seven days.

Was this page helpful?