Architecting AWS for a 24/7 cross-border logistics operation

Context

The client is a cross-border trucking company operating between the US and Mexico, with 150 employees, 370 tractor-trucks and 500 trailers in continuous service. The operation runs on a set of internal applications: dispatch, mechanics, gates, drivers, payroll, interchanges, and a public-facing website.

When I joined as Senior DevOps Engineer in early 2025, the cloud footprint had grown organically. Some applications wrote directly against the production database from local development sessions. Secrets were duplicated across multiple locations. SSL was managed independently per app. There was no shared baseline for cost, security or reliability across environments.

My responsibility: take ownership of the AWS architecture end-to-end and standardize the operating model across the fleet.

The architecture

I rebuilt the architecture aligned with the five pillars of the AWS Well-Architected Framework.

Compute & networking

ECS Fargate for all production services. No EC2 instances to manage, autoscaling configured per service.
CloudFront + ACM in front of every public surface, with a centralized certificate strategy across all subdomains.
Route 53 as the single source of truth for DNS, with health-checked failover on critical paths.
Per-environment VPC topology to keep dev, staging and production fully isolated.

Data

RDS PostgreSQL with read replicas for reporting workloads and a dedicated readonly_dev user for safe ad-hoc queries on production.
A separate staging database seeded from production dumps, removing the need to develop against production data.
pgvector for embedding-based features.

Delivery

GitHub Actions pipelines per repository (each app is its own repo), with:
- Linting, type-check and test gate
- Build → push to ECR → deploy via ECS task-definition swap (blue-green, minimumHealthyPercent=100, maximumPercent=200)
- Automated rollback on health-check failure
Standardized secret management across GitHub Secrets, Dockerfile build args and workflow environment variables.

Observability

CloudWatch Logs with 30-day retention and metric filters on key patterns: connection saturation, 5xx spikes, deploy failures.
CloudWatch Alarms routed to SNS, with category-specific runbooks (memory pressure, connection saturation, external dependencies).

Key problems solved

1. Cost reduction of 50–75% across environments

Initial monthly bills ranged from $700 to $6,000+ across environments, with significant unjustified spend.

Actions that moved the number:

Right-sizing ECS task definitions based on real CloudWatch metrics, not estimates.
RDS storage IOPS adjusted on environments that didn’t require provisioned IOPS.
ECR lifecycle policies: protect :latest*, keep 20 historical builds, remove the rest.
CloudWatch Logs retention standardized at 30 days for operational logs, longer only where compliance required it.
Scheduled shutdown of non-production environments on weekends.

Final monthly range: $350 to $3,400/mo, with predictable cost curves.

2. Migrating apps off direct production-DB access

Three applications wrote to production directly from local development sessions. This put real data at risk: 28K+ inspections, ~500 operators, years of dispatch records.

The fix was structural:

Created a staging database with the same schema, seeded from production dumps.
Introduced a readonly_dev user with SELECT-only privileges on production for safe queries.
Added a DISABLE_EMAIL flag at the SES service level to prevent accidental emails to customers during dev sessions.
Documented the new local-dev workflow so it became the default for the team.

3. Offline-first for field operations

Several operations happen in environments with intermittent connectivity (trucks at yards, drivers at remote crossings). I designed an offline-first capture and sync model for the field apps: writes go to local storage, sync queues reconcile when network returns, and conflict resolution happens server-side with operator audit trail.

4. Same-origin URL handling across worktrees

Each app’s NEXT_PUBLIC_API_URL was hardcoded to localhost:3000, which broke when running two worktrees on different ports. I designed a same-origin URL strategy: derive the API base from the request itself in dev, fall back to the configured production URL in production. Documented as a per-app playbook to apply the fix consistently across the fleet.

Outcome

A multi-app AWS footprint operating within defined budgets with predictable cost behavior.
A safe local development workflow where the team can run any app locally without risking production data.
A resilience profile for field operations that handles connectivity loss.
A consistent deployment model across all apps with automated rollback on failure.

Going forward, architectural decisions are evaluated against the Well-Architected Framework pillars and the operating baseline established here.