Alerts

Table of Contents

Overview

The AppFirst Alert system lets you create customizable event-driven alerts so you’ll know when your applications and infrastructure are running at less than optimum performance. That way, you can resolve problems before they start to negatively impact your customers. The AppFirst system gives you complete control over when an alert is sent and whom it goes to. This means that you can make sure that the alert is seen by the person or people most suited to resolve the problem.

Configuring Alerts

Immediately upon installing a collector, AppFirst sets up five automatic server alerts for you: memory above 90%, cpu above 75%, full disk above 90%, collector down, and average response time above 1000000 microseconds. We also set up a Polled Data alert for disk throughput.

Average Response Time – Response time refers to the amount of time the Application Server takes to return the results of a request to the user. The response time is affected by factors such as network bandwidth, number of users, number and type of requests submitted.

To configure alerts, navigate to the Admin – Setup page and click the “Alerts” tab. Once you have alerts configured, this is where they will be listed (there is another page to view Alert History and Alert Status – more on that below). Click “Add a New Alert” on the right side to get started.

Settings

In the “New Alert” popup, the first tab is for your Alert Settings. First, give your alert a name, then select the source for your alert:

  • Process Group
  • StatsD
  • Log
  • Process
  • Server
  • Server Tag

For this example, I’ll be setting up an alert on a Process Group. Once you select the source type, select the target for the alert. In this case, it’s Apache(pod1) Process Group. 1 Then select the Trigger Parameter for this alert. The list of Parameters include any metric that we collect for any source. I’ll go ahead and select Average Response Time for my Trigger Parameter. Once you change Trigger Parameters, the graph below will update to reflect the new Parameter. Select the Threshold Type you want to use – Single or Out of Band. I’m going to select Single (for more info on Out of Band alerts, see the below section on Analytical Alerting). Choosing the Direction will dictate whether the alert will trigger on values above or below the line. The Threshold Value reflects the value of the line in the graph. You can either set the threshold value by moving the blue line in the graph or editing the numerical Threshold Value. Then, select the Duration where the alert has to be true before it triggers. For example, you can decide you don’t want to be notified of high response time unless it’s above the threshold for five consecutive minutes. 2

Users

Before you can Save an alert, you need to select which users will receive your new alert. You can choose how you’d like to send them alerts as well, whether it’s via email, SMS, or push notification through our mobile apps. Click Save and you’re done. Your alert will be displayed in the Alerts table. 3

Services

This tab is to configure your alerts to send to PagerDuty. For more information on that integration, click here.

Editing/Deleting/Disabling Alerts

To edit an alert, click on the alert in the Alerts table. The “Edit Alert” popup will display and you can edit the alert just like you would normally. To delete an alert, click the grey check mark. Once the check mark turns to green, click “Delete Selected” above. To disable an alert, click the grey check mark. Once the check mark turns to green, click “Disable Selected” above. If you’d like to enable that alert again, follow the same workflow but click “Enable Selected.” Polled Data Alerts are automatically added when new Polled Data is set up for a collector. Polled Data Alerts cannot be deleted, but can be enabled/disabled in the Admin – Setup – Alerts tab.

Viewing Alert Activity

You can view your Alert Activity by clicking on the Alerts page from the vertical navigation. Your alerts will be displayed in a stacked bar graph and color-coded per their alert status:

  • Ok
  • Warning
  • Critical
  • Resolved

4 The page will default to activity for the current day, but you can select the date range in the top right of the page. In the above image, I selected the “Last 30 Days” for the date range. The grid below the bar chart shows the most recent alerts for the date range you select. So in the example above, the table is showing alerts only for May 1. If you click on another bar, the table will update with the alerts for that time. In the image below, I clicked on the April 30 bar, which shows alerts that triggered on that day. 5 You can hover over the trigger to see the complete entry. 6 Clicking the grey “eye” icon next to the search box in the table will display all triggered alerts for the selected date range. 7

Resolving Alerts

You can resolve your alerts and let everyone on your team know that even though something isn’t normal, it’s OK. Whenever an alert is triggered, you can resolve it either from the email or in the Alerts page. When resolving in the UI, click the “Resolve” button next to the alert you’re resolving. 8 After clicking the button, a popup module will prompt you to confirm your alert resolution. Click Yes to confirm. Once confirmed, the blue “Resolve” button will turn grey. 9

Process Alerts

When you create a process alert, it’s only on that particular process ID (PID) running at that time. If you receive an alert that a process died, that alert will never trigger again because it was specific that that PID. What you should do instead is create a Process Group for all of the processes. Then, create a process termination alert on the Process Group, rather than an individual process. After doing so, whenever any process in the application terminates, you’ll get an alert.

Deterministic Root Cause Alerts

Deterministic Root Cause Alerts tell you not only that there’s a problem, but also:

  • What server is causing the problem
  • What specific process (or processes) on the server that is causing the problem
  • Any behavioral changes that have recently been observed on that process

When you receive a Deterministic Root Cause alert, it will tell you all the details about the incident that has occurred and what caused the problem. 10

Analytical Alerting

In addition to our standard single threshold alerts, Analytical Alerting allows you to set up “Out of Band” alerts. If your systems are constantly changing, that means that what’s “normal” for things like CPU and Memory will also change over time. What you really need is a rolling window that monitors changes as these things increase or decrease. To make an Out of Band alert, change the alert type in the drop down to “Out of Band.” When “Out of Band” is selected, additional horizontal lines will be rendered on the graph. A blue line represents the mean, while the orange lines represent the mean plus or minus the standard deviation times the band value. The “Out of Band” alert allows you to set a trigger when the value goes Above, Below, or Outside the orange boundaries displayed below. 3

Trigger = Outside

When you choose “Outside” the band as the trigger, this will trigger an alert if the value is above or below the user defined band of what’s normal. The grey “band” between the orange lines represent what is normal for that metric (and will not trigger when the metric falls in the grey area). This band changes over time based on the rolling calculation window of the last X hours or days, which is also user definable. 1

Trigger = Above

When you choose “Above” the band as the trigger, this will trigger an alert if the value goes above the band (if the value goes into the white space). 2

Trigger = Below

When you choose “Below” the band as the trigger, this will trigger an alert if the value goes below the band (if the value goes into the white space). 3

Polled Data Alerts

When a new polled data source begins to stream to our webapp, and alert on the polled data will automatically be created. By default, the alert will trigger off of critical values for the polled data item. The critical value will be defined by whatever the polled data script lists as its critical value. You can further modify and rename these alerts in the Admin – Setup – Alerts screen the same way that you would modify any other alert. For more information on Polled Data please see our Polled Data Documentation.

How we calculate this data

At AppFirst we store the history for every metric, so based on the user-defined window, we calculate the mean and standard deviation over that time period. If you refer to the figures above, we are calculating the mean and standard deviation for CPU for server “frontenddev” over the last 1 hour and multiplying the standard deviation by a factor of 2.19. If, at any minute, we notice that the CPU for this server rises above the upper boundary specified by mean + standard_deviation * band_value then the alert is triggered. It is important to note that the mean and standard deviation changes over time since the window period always references the most recent historical data. 15 The bell curve above represents what a normal distribution should look like. As you can see the majority of the metric value (95.4%) will reside within two standard deviations away from the mean, but you may adjust this to your liking depending on your data.