This update contains multiple enhancements relating to outbound alerts, inbound alert signals, and the ability to enhance manually created alert rules with anomalies detected by our machine learning.
Outbound alerts settings have been re-designed to support multiple outbound alert channels – via Slack, email, ZELK (Elastic stack) or webhooks. Each channel can be given a custom name.
These named alert channels can now be assigned to specific ML detected incident types, or human defined alert rules. For instance, when Zebrium ML detects a new incident type that is of interest to a specific audience, future occurrences of that type of incident can be routed to the corresponding alert channel, with a chosen severity (P1-P5).
If you do not choose a specific alert channel, future occurrences will not be directed to any alert destination. They will still be viewable in the Zebrium incident list, using the “view all occurrences” filter setting. Note: the latter setting is functionally is equivalent to clicking the “mute” button in earlier releases, or the “mute” button on the list view.
Note: you need to click the “update incident” button for the new settings to be saved.
In earlier versions, only incidents detected by machine learning were viewable in the incident list in the Zebrium UI. With this update, manually defined alert rules are also reflected in the incident list, with the option to filter them in or out as desired. Manual rules can be filtered using the “Rule Based” filter setting, while ML detected incidents are controlled by the “Auto Detected” filter setting.
The Incident Tab now offers richer filtering capability –filter by first occurrence only (vs all), open / muted incidents, filter by assigned user, by service type (logtype) and host. See the green Filter button in the picture below:
One of the most powerful capabilities in this update is the ability to augment a user defined rule with ML. For instance, let’s say you want to be alerted anytime this timeout field in the logs has a value > 100 ms (see below). Now, you can also use Zebrium’s ML to provide possible root cause indicators as to why this happened. Simply create a filter, and in the alert rule settings, check the option to augment this alert rule with Zebrium ML.
The ability to augment static rules with ML can be extended to inbound signal alerts from 3rd party tools. Let’s say your APM or ITSM tool generates an incident for a particular service within PagerDuty, Slack or Opsgenie. You can configure any of these as inputs to Zebrium’s ML, and the inbound signal will trigger a machine learning defined incident that includes anomalous events and metrics that best describe the incident.
The incident list and deep dive workflows have been re-designed to make them more efficient, intuitive and richer in functionality. The incident list pane has been redesigned to show one list (vs today, yesterday, liked, muted etc).
- You can filter the list by attributes such as owner, state (muted vs open).
- You can quickly navigate to a date of interest using the up/down arrow keys.
- Incident summaries on the list view still show affected hosts and services. In addition, they now show a description field (if a user has added one), and 2 events – the first one (typically root cause), and the worse one (e.g. highest severity symptom).
- You can collapse an incident summary, mute it (which will no longer show incidents of this type in the open list), or click on "details"
The incident details view is now integrated with the log viewer as a collapsible/expandable pane.
- By default the detail view shows you a timeline of the incident events, by host and service. Clicking on any of the dots navigates the Log Viewer to the matching event. The first event is designated by the green dot/highlight, and the “worst” event by a purple one.
- If the incident type has repeated, this is indicated by the dots in the “occurrences” section, where each cycle of the sine wave represents one day. Clicking on these dots navigates to the matching incident occurrence.
- The collapse button will collapse the details pane.
- The back button will take you back to the incident list view (it remembers your earlier location in the list).
- Clicking the “nearby” button will show you an expanded view – adding other anomalies and errors near the incident events.
It is now possible to add a additional information to the incident (which will show up in list view), assign it to another user, add a description, and a link to a Jira ticket.
Finally, the equivalent of the "like" and "mute" feedback are available in the "Alert/Mute" section. Clicking "mute" behaves like before – future occurrences of this type of incident will no longer show up in incident lists (or as Slack alerts). The "like" feedback button has been replaced by the "alert" checkbox for clarity. Checking this will ensure future occurrences of this type of incident create alerts in Slack (and other mechanisms such as webhooks), as well as showing up in the list view.
- Highlight context menu is accessible by right clicking in Log Viewer.
- Removed Signatures - replaced with View/Alerts.
- Always display the Grafana tab to encourage users to send Metrics for enhanced Incident Detection and use dashboarding facilities.
- Moved all Incident Settings to a top-level dialog box in the Incident List page.
- Consolidated Slack integrations into a single tab under Settings.
Automatically catch application incidents and see root cause using your Elastic Stack. No manual training, no manual alert rules and no changes to your end points.
Auto-detected application incidents are displayed in an elegant Kibana dashboard. With just a click, you can view the set of correlated log events that describe the root cause. Thumbs up and down buttons let you provide feedback on incident quality and customize your incident feed.
- Configure an additional output plugin in your Logstash instance to send log events and metrics to Zebrium.
- Zebrium’s Autonomous Incident Detection and Root Cause will send incident details back to Logstash via a webhook input plugin.
- Incident summary and drill down into the Incident events in Elasticsearch is available directly from the Zebrium ML-Detected Incidents canvas in Kibana.
- For advanced drilldown and troubleshooting workflows, simply click on the Zebrium link in the Incident canvas.
- ZELK Stack integrations require the ELK stack including Logstash.
- Secure end-point for the Zebrium outgoing webhook to send Incident details to Logstash/Kibana
- Uses the Logstash HTTP Input Plugin with SSL and Authentication enabled.
We have introduced new features for User Management and Role Based Access Controls whereby you can create groups, assign roles to users, and assign users to groups.
By default, nothing will change anyone’s access/roles that you have today so there is nothing you need to do unless desired. This means that all users will be assigned the least restricted Owner role.
- Groups: Groups define which deployments are available to Users in the Group.
- Roles: Pre-defined roles (Owner, Admin, Member) which define permissions (e.g. Create, Read (view), Update, Delete) for each feature or application setting.
- Users: Each user is assigned a Role (permissions on features/settings) and Users are members of one or more Groups to control which deployments they can access.
Click here for Detailed information on User Management/RBAC
- You can now right-click on the hallmark event in the Incident List page to expose a context menu that will allow you to:
- Search Google or Stack Overflow
- Provide feedback to our ML by selecting Like, Mute, Spam
- Copy the event text
- We’ve added a tutorials page with 10 (and growing) short videos that explain how to get the most from the Zebrium UI. Please check it out and send us feedback and suggestions for more videos!
- One of Zebrium’s innate features is to automatically learn the “dictionary” of unique event types of an application stack, including the event structure and any variables embedded in the log events. The ML will type any variables (as float, int, string, IP address etc.), and even try to name the variables as best as possible from the event structure. The full list of etypes is visible in the filter bar, to allow accurate drill down or precise alerts. It is also possible to build heatmaps based on embedded variables in the etypes.
In some cases, users might prefer to over-ride the ML and define how specific fields are parsed and named, say for analytics purposes. This is now possible from the “custom etype” menu under settings.
- Scalability and memory usage of the Prometheus collectors has been significantly enhanced, particularly for larger clusters (>250 nodes)