← Case studies
Hospitality SaaS (multi-branch POS) · 2024

Resolving a 7-hour HikariPool exhaustion in production

Anatomy of an outage: 23K errors, the cascade that hid the root cause, and the observability that came out of fixing it.

Role Solutions Architect & DevOps Lead
~7h 25min
Outage duration
23,500+
Errors logged
137/h → 21/h (-85%)
Connection leak
15+
Alarms added
AWSECS / FargateRDS MySQLJava / Spring BootHikariCPCloudWatch

Context

A multi-branch POS SaaS running on ECS Fargate (Java / Spring Boot) with RDS MySQL. Three services share the database through HikariCP connection pools: gateway, auth, cash-register.

One afternoon, the cash-register service stopped accepting requests. The pattern in logs: HikariPool-1 - Connection is not available, request timed out. By the time we identified the cause, 23K errors had accumulated and the service had been degraded for ~7 hours.

The cascade that hid the root cause

The database itself was operating normally. CloudWatch RDS metrics showed CPU at normal levels, connections well below max_connections, and query latency unchanged.

The missing layer was connection-pool-level metrics. The available visibility covered the database side, not the internal state of each service’s pool. The symptom (timeouts) appeared to be database-related, but the actual cause (leaks inside the application) wasn’t visible from RDS metrics.

The cause, identified through thread dumps:

  • 55 services in the codebase used transactional methods, several with OSIV (Open Session In View) enabled — the Spring pattern that keeps a Hibernate session open during the entire request, including view rendering.
  • Several controllers were calling external APIs inside transactions, holding connections for seconds during the HTTP call.
  • Under load, the pool drained faster than connections returned. Once exhausted, every new request queued on the wait queue until timeout.
what the dashboards showed (green) what was actually happening RDS MySQL · metrics CPU normal · max_conn far · latency flat → "database is fine" no metric for what's happening inside each service's pool ↓ gateway HikariCP pool: 300 auth HikariCP pool: 150 cash ⚠ leaked inside the leak path :: 55 services audited @Transactional External API conn held for seconds during HTTP call + OSIV keeping session open through render Pool exhausted · timeouts "HikariPool-1 - Connection is not available" 7h 25min outage until pattern identified 23,500+ errors stacked on wait queue 137/h → 21/h leak rate after fix · 48h post-deploy

The fix

  1. OSIV disabled at the application level — sessions live only inside @Transactional boundaries.
  2. 55 services audited for transaction scope; external calls moved outside the transactional block.
  3. HikariCP pool sizing rebalanced across the three services based on actual concurrency profiles (gateway 300, auth 150, worker 50 — total 500 of 2,730 max, ~18% utilization headroom).
  4. Connection leak detection turned on in Hikari config (leakDetectionThreshold) so future leaks log a stack trace immediately.

Result: connection leaks dropped from 137/h to 21/h (-85%) within 48 hours of deploy.

Observability improvements

To prevent recurrence and surface similar patterns before they reach production, I added:

  • CloudWatch metric filters on log lines matching HikariPool.*timed out, Connection is not available, and pool stats: — alarm configured at rate > 5/min.
  • CloudWatch Insights queries saved as named runbook entries (pool-saturation, connection-leaks, slow-transactions).
  • 15+ new CloudWatch Alarms organized by category: memory pressure, connection saturation, task health, RDS replication lag — each with a one-line runbook attached to the alarm description.
  • Container Insights enabled on ECS Fargate for real-time CPU, memory and connection visibility per task.

When a similar pattern appeared in sandbox five days later during the OSIV rollout on a different service, the alarm fired in under 2 minutes and the issue was caught before it reached production traffic.

Outcome

The incident was resolved and the patterns that caused it were addressed at the application and observability layers. Pool-level visibility now exists where it didn’t before, with alarms and runbooks in place for the categories of failure observed during the incident.