What Is Observability? And How To Evolve From Monitoring
Monitoring servers is challenging. And as you add more servers to support your global teams, monitoring becomes increasingly difficult.
Without a way to monitor your large topology, you’re probably spending a lot of time reacting to problems –– and constantly putting out fires –– like:
- Running out of disk space.
- Automation not working properly.
- Issues with hardware and OS.
- Server capacity concerns (CPU, memory, temperature).
All of these issues can halt development and delay production. But with monitoring, you can see what is happening. You can tell if there is high transaction volume on a single server. And with observability, you can get ahead of problem before it becomes an issue.
So What Is Monitoring?
Monitoring is a way to look into what your servers are doing in real time. It can provide you with actionable data. It is most often used for troubleshooting and capacity planning. It is typically done at three levels:
- Network (e.g. traffic, bandwidth, latency).
- Machine (e.g. CPU and memory utilization and storage).
- Application (e.g. rate of user commands, locks, large syncs, commits/submits, etc).
What Is the Difference Between Observability and Monitoring?
Although observability might seem like it’s just monitoring with a DevOps facelift, it has evolved to include:
- Alerting/visualization (e.g. monitoring dashboard).
- Tracing infrastructure (e.g. for distributed systems).
- Log aggregation/analytics.
Observability can be used for debugging, complex troubleshooting, and performance analysis. It looks at the big picture. For example:
- How can you predict if a particular server is going to run out of disk space based on the historical patterns, or user behavior?
- How does a remote server behave under a certain condition?
- What happens if operating system or application settings are changed? Do you see a spike or sudden reduction in traffic or performance?
Monitoring tells you if a system is working, and can show trends and unusual patterns of usage or performance. Observability looks to answer the question why it’s happening.
Implementing observability builds on top of monitoring. It helps get you in front of potential issues. This saves time and money because you can detect a problem and fix it before development is affected. Then, looking at the big picture, you diagnose and implement improvements for the future.
Three Pillars of Observability
Observability enables you to be proactive, not reactive. It is an attribute at the intersection of monitoring, analytics, and alert management. There are three pillars of observability:
Each of these ideas come together to provide a holistic view of your system.
To debug or solve problems, you need access to the right information. Server logs contain the information you need to diagnose an issue. Plus logs can track the history of changes, which helps with audits and compliance.
Logs are easy to generate. But, they can be expensive to manage and process. For large organizations, it can be difficult to search through logs at scale. Tailing logs is often not enough, and can be hard to orchestrate across multiple servers and applications. For example, looking at an application log vs. web server log.
Logs contain a lot of information, and can quickly grow large. To help manage this information, logs need to be rotated regularly. This helps avoid issues such as logs consuming too much disk space on your server.
While logs are necessary to help diagnose issues, log analysis on its own will not always be sufficient to alert you when something goes wrong.
Metrics is all about the numbers. Each metric looks at specific data over time to help understand past trends and events, and what is happening now. The information provided by metrics can be used to predict what will happen in the near future.
Observability Work Metrics
- Throughput: The amount of work a server is doing.
- Success: A percentage of work that is successfully executed.
- Error: The number of errors, or rate of errors.
- Performance: Quantifies how efficiently servers are completing work. For example, the number of requests done over a specific time.
Observability Resource Metrics
- Utilization: The amount of capacity being used as a percentage of time.
- Saturation: The amount of requested work that cannot be completed due to limited resources such as space available network bandwidth.
- Errors: Internal application or tool errors.
- Availability: The percentage of time that the server is available to handle requests.
These metrics, and others, can be used to build a dashboard to monitor in real-time. They can also be used to trigger alerts. The frequency of checking depends on your topology and development organization’s needs.
And unlike logs, metrics can usually be kept around longer because they typically take up less space. Even if traffic increases, or you get more servers, the cost to store metric data does not really increase.
Logs and metrics can give you observability, but it is usually just about a specific server or part of a larger system. Traceability is the link that goes across several systems to help diagnose larger issues.
Tracing follows a series of related events that show a server request from end to end. They can help you with live-debugging.
Helix Core and Observability
Helix Core is often used in diverse topologies, with many servers acting as replicas. Helix Core quickly delivers assets to contributors in development centers and studios around the globe.
When it comes to observability, it provides logging features that include the raw data you need. Logging can be combined with client commands to automate workflows.
Not a Helix Core customer?
Your Observability Tool Stack
You’re going to struggle with observability if you don’t have the right tools. Picking a powerful combination of observability tools is vital. There are a lot of options to choose from when it comes to observability.
Many Helix Core administrators have found success using combinations of: Elastic Stack (formerly ELK), Splunk, Nagios, CheckMK, Prometheus, and Grafana, among many others tools. Integrations exist to feed Helix Core logs and other information into these tools.
For our next blog we will give you the code to implement monitoring and observability. Using open source tools and Helix Core, you will get access to a demo dashboard that illustrates how to:
- Collect and use logs.
- Access the metrics and traceability data you need.
Want help setting up observability with Helix Core?