Resolving a 7-hour HikariPool exhaustion in production

Context

A multi-branch POS SaaS running on ECS Fargate (Java / Spring Boot) with RDS MySQL. Three services share the database through HikariCP connection pools: gateway, auth, cash-register.

One afternoon, the cash-register service stopped accepting requests. The pattern in logs: HikariPool-1 - Connection is not available, request timed out. By the time we identified the cause, 23K errors had accumulated and the service had been degraded for ~7 hours.

The cascade that hid the root cause

The database itself was operating normally. CloudWatch RDS metrics showed CPU at normal levels, connections well below max_connections, and query latency unchanged.

The missing layer was connection-pool-level metrics. The available visibility covered the database side, not the internal state of each service’s pool. The symptom (timeouts) appeared to be database-related, but the actual cause (leaks inside the application) wasn’t visible from RDS metrics.

The cause, identified through thread dumps:

55 services in the codebase used transactional methods, several with OSIV (Open Session In View) enabled — the Spring pattern that keeps a Hibernate session open during the entire request, including view rendering.
Several controllers were calling external APIs inside transactions, holding connections for seconds during the HTTP call.
Under load, the pool drained faster than connections returned. Once exhausted, every new request queued on the wait queue until timeout.

The fix

OSIV disabled at the application level — sessions live only inside @Transactional boundaries.
55 services audited for transaction scope; external calls moved outside the transactional block.
HikariCP pool sizing rebalanced across the three services based on actual concurrency profiles (gateway 300, auth 150, worker 50 — total 500 of 2,730 max, ~18% utilization headroom).
Connection leak detection turned on in Hikari config (leakDetectionThreshold) so future leaks log a stack trace immediately.

Result: connection leaks dropped from 137/h to 21/h (-85%) within 48 hours of deploy.

Observability improvements

To prevent recurrence and surface similar patterns before they reach production, I added:

CloudWatch metric filters on log lines matching HikariPool.*timed out, Connection is not available, and pool stats: — alarm configured at rate > 5/min.
CloudWatch Insights queries saved as named runbook entries (pool-saturation, connection-leaks, slow-transactions).
15+ new CloudWatch Alarms organized by category: memory pressure, connection saturation, task health, RDS replication lag — each with a one-line runbook attached to the alarm description.
Container Insights enabled on ECS Fargate for real-time CPU, memory and connection visibility per task.

When a similar pattern appeared in sandbox five days later during the OSIV rollout on a different service, the alarm fired in under 2 minutes and the issue was caught before it reached production traffic.

Outcome

The incident was resolved and the patterns that caused it were addressed at the application and observability layers. Pool-level visibility now exists where it didn’t before, with alarms and runbooks in place for the categories of failure observed during the incident.