The CPU load we have come to grips with sounds like the beginning of a technical horror movie in which the antagonist, instead of choosing suburban families as his victims, takes aim at the database. There was no indication of this, but in an instant, the key application supporting our client's sales, started to run so slowly that it became impossible to log into it. 28 TB of database became unavailable, and the entire production warehouse froze like a GTX 480 on which someone had just tried to run The Witcher 3.
A similar story could have happened in any company. At the same time, it underscores how suddenly and unexpectedly performance issues can hit a database, paralyzing key business operations. This case is an excellent example of how important it is not only to react quickly to crises, but above all to anticipate and prevent them.
*As a reputable provider of database solutions, we always approach each case with full awareness of responsibility. Protecting the privacy of our customers' data is a priority for us. As a result, any sensitive information has been anonymized to ensure the safety of our customers, while allowing us to present this interesting case to a wider range of professionals.
The moment we realized that our client's database had such a serious problem that it prevented them from doing any further business, bringing the status quo became a top priority for us.
100% CPU load
At the very beginning, we took care of trying to identify the source of the downtime. Preliminary diagnosis of the problem carried out using our monitoring applications allowed us to understand that at the critical moment, it was the CPU resource that was used at 100%. This can be seen in the graph shown above, where the fault area is marked in yellow.
In the IT team responsible for solving this problem, different members could see the situation differently:
There were two possibilities on the table. The first suggested solving the problem by increasing resources. With a whopping 28 TB of data and 32 high-performance processors, this meant a considerable expense. The second, more insightful, presupposed understanding of the roots of the problem.
The choice fell on the second path. It was not an easy decision, but the team agreed that only by understanding the core of the problem could a lasting solution be found. One that will not only remedy the current crisis, but also protect the system from similar challenges in the future. As it turned out later, it was the right decision - increasing the power of the CPU, not expensive enough, would not cause even a partial solution to the problem.
We had already come to the conclusion that it was the CPU saturation that was the source of the problem, however, it was only the beginning of the road. The next stage was to understand what exactly is responsible for such high CPU consumption. Analysis of CPU time graphs made it possible to quickly narrow down the circle of suspects.
The red line represents the overall CPU uptime, and the yellow fill indicates the CPU uptime occupied by a single query:
CPU time and CPU time query
Then we identified a specific query that suddenly began to load the CPU extremely intensively.
A query that saturates the CPU
So the question was posed: is the query that caused the CPU load a novelty in the system, or was it present earlier, but on a much smaller scale? The answer was provided by a graph showing the number of launches of this query over the course of January 2024. The data was shocking — on January 21 and 22, the number of executions of this query jumped from a daily 3.5 million to more than 100 million. Such an unexpected increase was an obvious source of performance failure:
Graph showing the number of times a problematic query was run
Understanding the scale of the problem quickly directed our actions beyond the confines of this database. It became clear that the solution would not be in the optimization of queries or in increasing the computing power of the servers. Instead, it was necessary to turn to the team responsible for the development of the application, we needed to understand what changes could have caused such a drastic increase in the number of query runs. The question to be asked was not only about the technical aspects of the change, but also about its business rationale — it was obvious that such a sudden increase in activity could not have been due to a natural increase in market demand.
This, and many other experiences, underscore the importance of a holistic approach to technical problems. From the technical aspect, to the infrastructure, to the business implications, the different perspectives in the IT team can contribute to a deeper understanding of the situation.
We are often tempted to solve problems by investing in hardware — money for such expenses will almost always be found. However, it is worth remembering that this is often more like treating the symptoms than the disease itself, and increasing hardware resources is rarely the best solution.