Alert Correlation | Discussions

  • 1.  Alert Correlation

    Posted 09-20-2017 11:28
    what is Alert Correlation?

    It is the process in which similar alerts are grouped together so that an operator can focus their attention on a single ?phenomenon? rather than individual alerts. For example, a storage failure could cause a cascade of other failures on the virtualization layer, the operating system layer, and the application layer, etc. The alerts from each of the layers should be correlated into a single group in order to quickly ascertain the full picture of the failures. More importantly, this allows for the possibility of creating just one incident for a given phenomenon to ensure the right people are engaged. After all, it is all too common to have various teams (network, infrastructure, application, etc.) wasting cycles looking into their own incidents only to realize that the root cause was somewhere else. With Alert Correlation, the correct source of a failure is identified as quickly as possible, which critically reduces the mean time to resolution (MTTR).
    So how does Alert Correlation work? Here are the four underlying mechanisms:
    1. Rule-Based Correlation:

    This is the most basic way to create alert groupings. Using Correlation Rules, you are able to manually define primary alerts, secondary alerts, and how alerts should be related to one another. The alerts could be related according to their attributes or based on the relationships between the CIs of the alerts. For example, you could define an alert on a virtual machine (virtualization layer) to be the primary alert while having all alerts on the corresponding operating system layer be secondary alerts.
    This functionality is absolutely table stakes and was introduced at the very beginning of the Event Management offering. However, no one likes writing correlation rules manually, so eventually, this should be superseded by other alert grouping mechanisms through Service Analytics, which leverages machine learning to group alerts together. The following three mechanisms are all features of Service Analytics.
    2. Temporal Analysis:

    This is arguably the most magical part of Service Analytics, which applies machine learning to identify grouping patterns based on sliding time windows. When looking at historical alerts, recurrent patterns or multiple occurrences of the same set of alerts are grouped together. The algorithm then creates a grouping pattern so that future alerts matching this grouping pattern will be correlated into the same group.
    For example, if the algorithm identifies a recurring pattern of alerts against a disk, a storage, a VM, and an application, it would create a grouping pattern that will automatically group future alerts that match this particular composition; thus, achieving the ability to create a single group for these alerts. This type of the grouping is called automated alert groups. In the Jakarta release, additional capabilities have been added to provide Predictive Alerts and Root Cause Analysis of the alerts; we?ll explore them further in the future to see how this helps prevent an outage before it happens as well as identifying the root cause when an outage does occur.
    3. Topological Analysis:

    With topological analysis, Service Analytics takes advantage of the CMDB relationships defined in the hosting rules and containment rules in the Metadata Editor, which is a concept of the CMDB identification mechanism. Out of the box, we provide a number of definitions for how CIs are related to one another. For example, a web application needs to run on a server. This means that an alert on the web application and an alert on the server that runs the web application should be grouped together because they match the hosting rule defined.
    4. Semi-supervised Learning:

    Finally, Service Analytics also allows you to manually group alerts together; and it will learn to group similar alerts in the future. For example, if you group a set of alerts for network, database, server, and application, the next time when this pattern is detected, the semi-supervised learning will recognize these alerts had been grouped together in the past and automatically group them together into an alert group.