Designing a recurring audit framework for AWS infrastructure

The problem

A continuously evolving AWS footprint generates recurring operational questions that benefit from periodic review:

Are services right-sized after recent load changes?
Is any IAM policy more permissive than required?
Are backups being taken and tested as expected?
Is any data stored unencrypted unintentionally?
Are there Lambda functions outside the VPC that shouldn’t be?
Are CloudWatch log groups accumulating storage without retention policies?

Each of these can be answered through the AWS console, but doing it consistently across multiple environments requires sustained effort. The questions tend to get deprioritized until an issue surfaces.

The framework

I designed the framework around five pillars of the Well-Architected Framework, each with a defined scope and verification approach:

Cost — EC2, ECS/Fargate, RDS, Lambda, S3, networking and storage. Cross-referenced with CloudWatch utilization to identify right-sizing opportunities.
Security — IAM, RDS, S3, Lambda, networking, encryption and security service posture.
Database — RDS CloudWatch metrics in the remote phase, with optional detailed review (SSH + MySQL) for slow queries, deadlocks and problematic events.
Reliability — backups, alarms, service stability, ENI limits, health checks and failover configuration.
Hygiene — orphaned resources, misconfigurations, missing lifecycle policies and unnecessary storage accumulation.

The framework requires that every finding must be backed by a real API call as evidence. No finding is accepted into the report without a verifiable source. This rule is applied consistently regardless of the tool that executes the check.

Implementation

The framework is executed through a combination of the AWS CLI, Bash and Python scripts, and a set of Claude Code commands that automate the data collection across the defined scope. The decision to use AI-assisted execution was deliberate: it removes the friction of running the checks manually each time, while keeping the contract (evidence-based findings) enforced by the framework itself, not by the tool.

The judgment loop stays human throughout:

I design and update the framework, including which checks belong in which pillar and how severity is assigned.
The execution layer (whichever tool) only collects and reports against the framework’s rules.
I review every report, prioritize based on context, and decide what gets implemented.

How an audit run works

A run uses read-only IAM access and produces a Markdown report grouped by severity:

Critical — requires immediate attention
High — should be addressed within the week
Medium — backlog item
Low — improvement opportunity

Each finding includes:

Exact AWS resource ARN
Current state, with the API call to verify it
Recommended fix, with the API call to apply it
Brief explanation of why it matters

Operating constraints

Execution operates with read-only IAM permissions. Any apply step is an explicit manual action.
Every finding references a real API call as evidence. This is a framework requirement, not a tool-level option.
All output is reviewed manually before action. Automated execution speeds up data collection; prioritization and implementation are decided by the operator.

Outcome

The audit cycle that was previously inconsistent now runs reliably across all environments. Findings are documented with reproducible verification steps, which makes the review process auditable and the fixes traceable. The framework defines what gets checked, how severity is assigned and what evidence is required. Execution speed is a benefit of the implementation; the integrity of the audit comes from the framework’s structure.