Monday morning. The application started throwing "Too many connections" errors at 09:12. The on-call engineer raised max_connections from 500 to 1,000. The errors continued. At 09:45 they raised it to 2,000. By 10:00 the server was OOM-killed. More connections had made the problem worse.
The Alert
Application error: ERROR 1040 (HY000): Too many connections. At peak, 500 connections were exhausted within 2 minutes of application startup. The application connection pool had been configured without a wait timeout — connections that could not establish would retry indefinitely, creating a flood of connection attempts against the already-exhausted pool.
First Hypothesis: Not Enough Connections
The immediate response — raise max_connections — was wrong. We found out why when the server ran out of memory. Each MySQL connection reserves memory: thread stack, sort buffer, join buffer, read buffer. At 2,000 connections, the theoretical maximum memory was 2,000 × ~8MB = 16GB on a 16GB server. The OOM killer ran before any queries did.
The Discovery
SELECT user, host, db, command, time AS seconds, state, LEFT(info, 80) AS query_preview FROM information_schema.processlist ORDER BY command, time DESC LIMIT 50;
Of the 500 connections: 487 were in state Sleep — idle connections held by the application's connection pool across 24 application instances, each with pool_size=25. Of the 13 active connections, 11 were waiting on the same lock. One connection had held an exclusive lock on the products table for 340 seconds — an admin script running a bulk UPDATE had been left running in a transaction from an overnight job that had been forgotten.
Incident Timeline
| Time | Event |
|---|---|
| Sat 23:00 | Overnight admin script starts bulk UPDATE on products table in a transaction |
| Sat 23:15 | Admin closes laptop. Transaction left open. Lock held. |
| Mon 09:00 | Monday morning traffic begins. Application instances start. Pool fills. |
| 09:12 | New queries queue behind the lock. Connection pool exhausted waiting. |
| 09:12 | "Too many connections" errors begin. |
| 09:14 | max_connections raised to 1,000. More connections accepted, all queue behind lock. |
| 09:45 | max_connections raised to 2,000. OOM kill at 10:00. |
| 10:15 | Server restarts. Admin script transaction rolled back automatically. |
| 10:16 | Lock released. Connections serve normally. |
Root Cause
A forgotten open transaction held an exclusive lock on a high-contention table for 34 hours. Application connections queued behind the lock, exhausting the connection pool. Raising max_connections allowed more connections to queue — consuming more memory — without resolving the underlying lock. The correct response was to identify and kill the blocking transaction, not increase the connection limit.
The Fix
-- Find long-running transactions SELECT trx_id, trx_state, trx_started, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS age_sec, trx_mysql_thread_id AS connection_id, trx_rows_locked, trx_rows_modified FROM information_schema.innodb_trx ORDER BY trx_started LIMIT 10; -- Kill the offender: KILL [connection_id]; -- Set wait_timeout to auto-kill idle transactions: -- SET GLOBAL wait_timeout = 3600; -- 1 hour max idle
Prevention
wait_timeout is now set to 3600 seconds globally — no connection can idle for more than an hour. Admin scripts that use transactions run in a wrapper that sets a session-level lock timeout. The connection pool dashboard was added to the main monitoring board — connection count by state, not just total count.