← Case studies
Cross-border logistics · 2025

Designing a recurring audit framework for AWS infrastructure

Design of a recurring review system covering cost, security, reliability, database and hygiene across an AWS footprint, with structured execution and evidence-based reporting.

Role Senior DevOps Engineer & AWS Solutions Architect
5 (cost, security, reliability, DB, hygiene)
Audit pillars
On-demand + scheduled
Cadence
API-backed findings
Evidence model
AWSAWS CLIBashPythonCloudWatchClaude Code

The problem

A continuously evolving AWS footprint generates recurring operational questions that benefit from periodic review:

  • Are services right-sized after recent load changes?
  • Is any IAM policy more permissive than required?
  • Are backups being taken and tested as expected?
  • Is any data stored unencrypted unintentionally?
  • Are there Lambda functions outside the VPC that shouldn’t be?
  • Are CloudWatch log groups accumulating storage without retention policies?

Each of these can be answered through the AWS console, but doing it consistently across multiple environments requires sustained effort. The questions tend to get deprioritized until an issue surfaces.

The framework

I designed the framework around five pillars of the Well-Architected Framework, each with a defined scope and verification approach:

  • Cost — EC2, ECS/Fargate, RDS, Lambda, S3, networking and storage. Cross-referenced with CloudWatch utilization to identify right-sizing opportunities.
  • Security — IAM, RDS, S3, Lambda, networking, encryption and security service posture.
  • Database — RDS CloudWatch metrics in the remote phase, with optional detailed review (SSH + MySQL) for slow queries, deadlocks and problematic events.
  • Reliability — backups, alarms, service stability, ENI limits, health checks and failover configuration.
  • Hygiene — orphaned resources, misconfigurations, missing lifecycle policies and unnecessary storage accumulation.

The framework requires that every finding must be backed by a real API call as evidence. No finding is accepted into the report without a verifiable source. This rule is applied consistently regardless of the tool that executes the check.

Implementation

The framework is executed through a combination of the AWS CLI, Bash and Python scripts, and a set of Claude Code commands that automate the data collection across the defined scope. The decision to use AI-assisted execution was deliberate: it removes the friction of running the checks manually each time, while keeping the contract (evidence-based findings) enforced by the framework itself, not by the tool.

The judgment loop stays human throughout:

  • I design and update the framework, including which checks belong in which pillar and how severity is assigned.
  • The execution layer (whichever tool) only collects and reports against the framework’s rules.
  • I review every report, prioritize based on context, and decide what gets implemented.
01 · trigger on-demand · scheduled /aws-audit orchestrator agent 02 · specialized audit agents · read-only iam cost ec2 · ecs · rds s3 · lambda security iam · encryption net · public access database rds metrics · slow queries · deadlocks reliability backups · alarms eni · failover hygiene orphans · lifecycle log retention contract :: verify every finding against a real API call before reporting 03 · synthesis ~20 min · markdown audit report · grouped by severity critical · high · medium · low — resource ARN verifying API call — — current state applying API call —

How an audit run works

A run uses read-only IAM access and produces a Markdown report grouped by severity:

  • Critical — requires immediate attention
  • High — should be addressed within the week
  • Medium — backlog item
  • Low — improvement opportunity

Each finding includes:

  • Exact AWS resource ARN
  • Current state, with the API call to verify it
  • Recommended fix, with the API call to apply it
  • Brief explanation of why it matters

Operating constraints

  • Execution operates with read-only IAM permissions. Any apply step is an explicit manual action.
  • Every finding references a real API call as evidence. This is a framework requirement, not a tool-level option.
  • All output is reviewed manually before action. Automated execution speeds up data collection; prioritization and implementation are decided by the operator.

Outcome

The audit cycle that was previously inconsistent now runs reliably across all environments. Findings are documented with reproducible verification steps, which makes the review process auditable and the fixes traceable. The framework defines what gets checked, how severity is assigned and what evidence is required. Execution speed is a benefit of the implementation; the integrity of the audit comes from the framework’s structure.