The incident list and deep dive workflows have been re-designed to make them more efficient, intuitive and richer in functionality. The incident list pane has been redesigned to show one list (vs today, yesterday, liked, muted etc).
- You can filter the list by attributes such as owner, state (muted vs open).
- You can quickly navigate to a date of interest using the up/down arrow keys.
- Incident summaries on the list view still show affected hosts and services. In addition, they now show a description field (if a user has added one), and 2 events – the first one (typically root cause), and the worse one (e.g. highest severity symptom).
- You can collapse an incident summary, mute it (which will no longer show incidents of this type in the open list), or click on "details"
The incident details view is now integrated with the log viewer as a collapsible/expandable pane.
- By default the detail view shows you a timeline of the incident events, by host and service. Clicking on any of the dots navigates the Log Viewer to the matching event. The first event is designated by the green dot/highlight, and the “worst” event by a purple one.
- If the incident type has repeated, this is indicated by the dots in the “occurrences” section, where each cycle of the sine wave represents one day. Clicking on these dots navigates to the matching incident occurrence.
- The collapse button will collapse the details pane.
- The back button will take you back to the incident list view (it remembers your earlier location in the list).
- Clicking the “nearby” button will show you an expanded view – adding other anomalies and errors near the incident events.
It is now possible to add a additional information to the incident (which will show up in list view), assign it to another user, add a description, and a link to a Jira ticket.
Finally, the equivalent of the "like" and "mute" feedback are available in the "Alert/Mute" section. Clicking "mute" behaves like before – future occurrences of this type of incident will no longer show up in incident lists (or as Slack alerts). The "like" feedback button has been replaced by the "alert" checkbox for clarity. Checking this will ensure future occurrences of this type of incident create alerts in Slack (and other mechanisms such as webhooks), as well as showing up in the list view.
- Highlight context menu is accessible by right clicking in Log Viewer.
- Removed Signatures - replaced with View/Alerts.
- Always display the Grafana tab to encourage users to send Metrics for enhanced Incident Detection and use dashboarding facilities.
- Moved all Incident Settings to a top-level dialog box in the Incident List page.
- Consolidated Slack integrations into a single tab under Settings.
Automatically catch application incidents and see root cause using your Elastic Stack. No manual training, no manual alert rules and no changes to your end points.
Auto-detected application incidents are displayed in an elegant Kibana dashboard. With just a click, you can view the set of correlated log events that describe the root cause. Thumbs up and down buttons let you provide feedback on incident quality and customize your incident feed.
- Configure an additional output plugin in your Logstash instance to send log events and metrics to Zebrium.
- Zebrium’s Autonomous Incident Detection and Root Cause will send incident details back to Logstash via a webhook input plugin.
- Incident summary and drill down into the Incident events in Elasticsearch is available directly from the Zebrium ML-Detected Incidents canvas in Kibana.
- For advanced drilldown and troubleshooting workflows, simply click on the Zebrium link in the Incident canvas.
- ZELK Stack integrations require the ELK stack including Logstash.
- Secure end-point for the Zebrium outgoing webhook to send Incident details to Logstash/Kibana
- Uses the Logstash HTTP Input Plugin with SSL and Authentication enabled.
We have introduced new features for User Management and Role Based Access Controls whereby you can create groups, assign roles to users, and assign users to groups.
By default, nothing will change anyone’s access/roles that you have today so there is nothing you need to do unless desired. This means that all users will be assigned the least restricted Owner role.
- Groups: Groups define which deployments are available to Users in the Group.
- Roles: Pre-defined roles (Owner, Admin, Member) which define permissions (e.g. Create, Read (view), Update, Delete) for each feature or application setting.
- Users: Each user is assigned a Role (permissions on features/settings) and Users are members of one or more Groups to control which deployments they can access.
Click here for Detailed information on User Management/RBAC
- You can now right-click on the hallmark event in the Incident List page to expose a context menu that will allow you to:
- Search Google or Stack Overflow
- Provide feedback to our ML by selecting Like, Mute, Spam
- Copy the event text
- We’ve added a tutorials page with 10 (and growing) short videos that explain how to get the most from the Zebrium UI. Please check it out and send us feedback and suggestions for more videos!
- One of Zebrium’s innate features is to automatically learn the “dictionary” of unique event types of an application stack, including the event structure and any variables embedded in the log events. The ML will type any variables (as float, int, string, IP address etc.), and even try to name the variables as best as possible from the event structure. The full list of etypes is visible in the filter bar, to allow accurate drill down or precise alerts. It is also possible to build heatmaps based on embedded variables in the etypes.
In some cases, users might prefer to over-ride the ML and define how specific fields are parsed and named, say for analytics purposes. This is now possible from the “custom etype” menu under settings.
- Scalability and memory usage of the Prometheus collectors has been significantly enhanced, particularly for larger clusters (>250 nodes)