Use of Analytics Leads to Helix Core Customer Improving Server Performance
Analytics and continuous monitoring are becoming increasingly important strategies for DevOps teams. Monitoring version control system (VCS) server speed and other operational metrics is critically important in large installations, as it allows developers to uncover bottlenecks and performance issues.
A use case from one of our larger Helix Core users underscores the importance of monitoring server performance. This team runs a massive server with 40 cores and a terabyte of memory to serve 100s of users and 100s of terabytes of data. The team noticed that users were experiencing slowdowns, even when a humble ‘p4 fstat’ command was executed. They asked the Perforce professional services team to look into the matter.
Fortunately, the DevOps team already had a range of monitoring tools in operation on this server. One of these tools is MY-NETDATA.IO, which provided a good overview of the state of the underlying server hardware and OS. This particular tool provides real-time analytics and has an interactive dashboard.
When analyzing for performance issues, it often comes down to identifying bottleneck(s) and addressing them in order. In this case, the bottleneck turned out to be the CPU. This is unusual because Helix Core is not normally CPU-bound. Instead, we expect the server to be IO-bound in some way if it is shifting 100s of gigabytes of data.
The customer is running on Linux and we suspected Transparent Huge Pages (THP), a feature the operating system offers to support processes with large memory requirements, to be the problem.
A Helix Core server itself typically does not consume large amounts of memory; instead, it relies on the file system to cache as much of the accessed database pages and depot files as possible. Under high load, the server has to process 100s of requests a second, and that led to the operating system busily reclaiming memory, thus locking up the kernel. This was easily demonstrated by looking at the graph below: the blue area represents the system space (kernel) CPU usage, while the yellow area represents user space CPU usage.
You can see the dramatic drop in system/kernel usage in this graph from the monitoring tool dashboard. This happened halfway through the time period shown, which was when the team disabled THP.
Fortunately, this can be done without having to shut down the server or even Helix Core process(es). With system space usage thus reduced, more processing power is available for the Helix Core server itself, thereby improving throughput and preventing CPU utilization spikes.
We are now working with the team to identify and address other potential bottlenecks, but this one quick configuration has already delivered significant performance improvements.
Even though this isn’t our first time writing about THP support on the blog, this scenario struck us as a compelling reinforcement for the importance of monitoring your server with analytics tools. With this newfound knowledge, we were quickly able to disable the THP feature on Linux to improve the performance of their Helix Core server.
For more information on transparent huge page support in Linux, visit the Perforce Knowledge Base.
We’ll be back soon (here and on the Perforce Knowledge Base) with more valuable information on how to improve performance, prevent bottlenecks, and make the best use of your Helix Core servers.