Domino effect in T10 database on g1rush Server

Microsoft SQL
Oracle
27/6/2024
Tomasz Chwasewicz
Table of contents

Performance problems rarely occur in isolation. Often, a single problem can trigger a cascade of related problems, each of which amplifies the overall impact on the system. This case study analyzes such a scenario in the T10 database on g1rush Server.

Identifying the initial problem: Temporary tablespace overflow

On June 21, 2024 at 13:04:32, DBPLUS PERFORMANCE MONITOR flagged a critical error: ORA-1652, indicating the inability to extend the temporary segment by 128 in the TEMPORARY_DATA tablespace in the T10 database on g1rush Server. This error occurred due to simultaneous sessions overwhelming the temporary space of tables, consuming more than 30 GB.

g1rush Server - Identyfikacja początkowego problemu

Highlighted in the chart, the spike in resource utilization immediately caught the company's attention. This overflow interfered with ongoing operations and indicated inefficient resource management.

Secondary problem: High CPU load

Further investigation revealed a persistent high load on the processor, even after reducing the number of active processes. The cyclical nature of this load, peaking during periods of high activity, suggested deeper problems with query optimization and resource allocation.

g1rush Server - Wysokie obciążenie procesora

The graph above shows busy time (CPU), system time (CPU), and user time (CPU) with a clear pattern of cyclic peaks. This persistent high load required us to delve into specific queries and processes that consume CPU resources.

Detailed query analysis: Execution time and plan changes

The next step was to analyze the specific queries contributing to the high CPU load. In particular, the query for the identifier 3445255751 showed significant changes in execution time and schedule. It was suspected that these changes were related to the increased execution time of another query, ID 904000402, due to the use of the shared function.

g1rush Server - Szczegółowa analiza zapytań: Czas wykonania i zmiany planu

The analysis showed that query 3445255751 experienced run-time fluctuations, which had an impact on overall performance. Comparing execution plans, we identified plan hash 232983816 as a potential cause of inefficiency.

This comparison highlighted the need to stabilize query performance by forcing a specific plan. The change in the implementation strategy suggested the need for consistent monitoring and timely adjustment of plans.

Network load

As we continued our investigation, we noticed significant network load, with data transfers as high as 52GB per hour. This indicated a significant flow of data, which may have exacerbated the observed performance problems.

Obciążenie sieci - g1rush Server

Analysis of network traffic showed a large number of bytes received and sent by SQL*net, highlighting the impact of data transfer on overall performance.

Related articles