A 36-item interactive checklist for FinOps and platform teams running mid-market Azure tenants. Your progress saves locally as you tick items.
Advertising disclosure: We earn commissions when you shop through the links below.
Most mid-market Azure environments at $10k–$80k/mo of cloud spend can take 20–35% out of the bill in a single 60-day push, without rewriting a single application. The waste is spread across over-provisioned VMs, under-reserved baseline compute, untiered blob storage, NAT and cross-region egress, and Log Analytics retention defaults that nobody set on purpose.
Before you optimize anything, know where the bill comes from. Typical mid-market Azure tenants we audit break down roughly like this — your mix will vary by workload, but the ranking rarely does:
Cost category
Share of bill
Optimization difficulty
VMs, VMSS, AKS node pools
45–60%
Medium — rightsize + reserve
Managed databases (SQL DB, PostgreSQL, Cosmos DB)
10–20%
Medium — tier + serverless
Blob and managed-disk storage
8–15%
Easy — lifecycle + tier
Networking (NAT, peering, egress, App Gateway)
6–14%
Hard — architectural
Log Analytics, App Insights, Sentinel
4–10%
Easy — retention + filtering
App Service, Functions, Container Apps
3–8%
Easy — plan tier + scale
The pattern: compute and managed databases are the bulk of the bill but they're medium-difficulty to optimize — they need rightsizing data and reservation analysis. Storage, observability, and App Service are smaller but easy wins. Most teams should attack the easy categories first to fund the work, then tackle compute.
Case study: $52k/mo Azure tenant, B2B SaaS company
Anonymized engagement — a Series-B SaaS company running three products on Azure across two regions. AKS for application services, Azure SQL for the primary OLTP database, Cosmos DB for one product's session store, blob storage for customer file uploads, and a Sentinel + Log Analytics setup their security team had configured 18 months prior and never revisited.
Line item
Before
After
Saved
AKS node pools (D8s v5, mostly idle nights/weekends)
$18,400/mo
$10,200/mo
$8,200
Azure SQL DB (Business Critical, low DTU usage)
$9,600/mo
$5,400/mo
$4,200
Cosmos DB (provisioned RU, peak-sized)
$5,200/mo
$2,800/mo
$2,400
Blob storage (no lifecycle, all Hot tier)
$4,100/mo
$1,650/mo
$2,450
NAT Gateway + cross-region egress
$3,800/mo
$2,100/mo
$1,700
Log Analytics + Sentinel (verbose ingestion)
$4,200/mo
$1,400/mo
$2,800
App Service, misc
$6,700/mo
$5,200/mo
$1,500
Monthly total
$52,000
$28,750
$23,250
45% reduction over 60 days. The biggest movers: switching AKS to scheduled-scaling with mixed reservation + Savings Plan coverage, dropping Azure SQL from Business Critical to General Purpose with the right Storage Performance tier, switching Cosmos DB to autoscale RU/s, and pulling Sentinel data ingestion under control with table-level retention. No customer-facing changes.
1) VM and AKS rightsizing Typical savings: 20–35% of compute
Open Azure Advisor → Cost and review every "Right-size or shutdown underutilized virtual machines" recommendation. Filter by impact and start with the largest.
Pull 14-day p95 CPU and memory per VM via Azure Monitor. Any VM where p95 CPU stays under 30% is a downsize candidate.
Benchmark newer VM generations (Dasv6, Easv6, Dpsv6 ARM) before re-buying. Per-dollar performance for general-purpose workloads typically improves 15–25% generation-to-generation.
Audit AKS node pools: confirm cluster autoscaler is enabled with sensible min/max bounds, and that user node pools downscale on idle (system pools usually shouldn't).
Use Spot node pools in AKS for stateless, interruption-tolerant workloads (CI runners, batch jobs, background processors). Spot pricing is 60–90% off pay-as-you-go.
Schedule auto-shutdown on every non-production VM via Dev Test Labs or a tag-based automation runbook. Static dev/staging VMs running 24/7 are the single most common waste pattern.
Audit VMSS instances: confirm scale-in policies match scale-out triggers — many VMSS deployments scale up readily but never scale down because the metric and threshold weren't paired.
2) Managed databases and PaaS data tier Typical savings: 25–45% of database spend
Review Azure SQL Database DTU/vCore utilization over 14 days. Business Critical tier is often overspecified — General Purpose with the right Storage Performance setting matches most OLTP workloads.
Evaluate Azure SQL Serverless for dev/test or low-utilization databases. Auto-pause cuts compute cost to zero during idle periods.
For Cosmos DB, switch from provisioned to autoscale RU/s on workloads with variable throughput. You pay only for actual peak, not always-provisioned peak.
Audit Cosmos DB indexing policy: default policy indexes every property. Excluding paths you never query reduces RU consumption 20–60%.
Right-size Azure Database for PostgreSQL/MySQL via the Memory Optimized to General Purpose transition where workload allows. Reserved capacity adds another 30–55% off the baseline.
Move old backups and snapshots off premium-priced backup storage to LRS where compliance permits.
3) Reservations and Azure Savings Plans Typical savings: 30–55% on baseline compute
Identify your true stable baseline: the compute that runs 24/7 even after all the rightsizing in section 1. Reserve only this baseline.
Default to Azure Compute Savings Plan unless your VM SKU is firmly locked (e.g., a specific GPU family). The flexibility usually outweighs the 3-5% discount difference vs RIs.
Choose 1-year terms unless you have >18 months of stable workload data justifying 3-year. The default should be 1-year.
Layer reservations across services: SQL DB RIs, App Service RIs, Cosmos DB RIs, Synapse RIs are separate from VM reservations and often forgotten.
Set a monthly reservation utilization review: if any reservation drops below 95% utilization for two consecutive months, that's a signal to exchange or sell on the Marketplace.
Verify Azure Hybrid Benefit is applied wherever eligible — Windows Server VMs, SQL Server VMs, and AHB for AKS Windows containers. This is a free 30–55% discount that's silently missed in 1 of 3 tenants we audit.
4) Storage lifecycle and disk hygiene Typical savings: 40–60% of storage spend
Enable Blob lifecycle management on every container with predictable access patterns: Hot → Cool after 30 days, Cool → Archive after 90 days, delete after retention period.
For containers with unpredictable access, evaluate Blob Access Tiering (automatic). Cheaper than always-Hot for any container where <30% of objects are accessed monthly.
Delete orphaned managed disks: detached disks left after VM termination. Use az disk list --query "[?managedBy==null]" to find them.
Audit premium-tier managed disks: Premium SSD v1 is rarely needed. Many workloads run fine on Premium SSD v2 (cheaper, more configurable) or Standard SSD.
Clean up old snapshots: set a retention policy. Snapshots cost the full provisioned size, not the differential, on Azure.
Review Storage Account redundancy: GRS and RA-GRS are 2x the cost of LRS. Use them only where business continuity actually requires geo-redundancy.
5) Networking, NAT, and egress Typical savings: 15–35% of networking spend
Audit the top 10 egress contributors by workload and region. NAT Gateway data-processing fees alone are often $0.045/GB — at scale this dwarfs raw transfer costs.
For container workloads pulling images, use Azure Container Registry geo-replicated to the workload region rather than pulling cross-region through NAT.
Place Private Endpoints on chatty PaaS services (Storage, SQL, Cosmos) instead of routing service traffic through NAT or public peering.
Consolidate cross-region traffic: review whether dev/test environments genuinely need cross-region peering, or if same-region would work.
Evaluate Azure Front Door / CDN for static and global traffic. CDN egress is meaningfully cheaper than origin egress at any non-trivial scale.
Right-size Application Gateway and Load Balancer SKUs: Standard tier is sufficient for most workloads; Premium WAF tier costs ~2x and is only justified when you actually use WAF rules.
6) FinOps governance and observability Typical savings: 50–70% of Log Analytics + faster catch on anomalies
Set per-table retention in Log Analytics. Default 90-day retention across all tables is the most common Azure observability waste pattern. Operational tables: 30 days. Audit/security: per policy.
Audit Sentinel data sources: every connector you enable ingests data. Disable connectors for sources your SOC doesn't actively monitor.
Enforce required cost allocation tags via Azure Policy (e.g., environment, team, product, cost-center). Without tags, chargeback/showback is impossible.
Set budgets and anomaly alerts in Cost Management per subscription, resource group, and product tag. Anomaly alerts catch the slow drift that threshold alerts miss.
Track unit economics: cost per customer, cost per workload, or cost per transaction. Absolute monthly spend hides the customer-cohort metrics that actually matter.
Run a monthly cost review with engineering, platform, and finance stakeholders. The single biggest predictor of sustained Azure savings is whether someone is actually looking at the bill on a cadence.
Azure-specific tooling worth evaluating
For mid-market Azure tenants past $30k/mo, third-party tooling often pays for itself within 2–4 months by automating what's manual on this checklist. Best fit depends on your existing stack:
Azure-native (free) — Azure Advisor for rightsizing and reservation recommendations, Cost Management for budgets and anomaly detection, Azure Resource Graph for cross-subscription queries. Start here. Most teams get 60–70% of the value before paying for third-party tools.
Multi-cloud FinOps platforms (e.g., Vantage, CloudZero, Apptio Cloudability) — cross-Azure/AWS/GCP visibility, anomaly detection, allocation, and unit economics modeling. Fit if Azure is one workload in a larger cloud footprint.
Reservation automation (e.g., ProsperOps) — automatically tunes RI and Savings Plan mix daily. Strongest fit for teams with steady but evolving workloads where manual reservation management is a chore.
Rightsizing engines (e.g., Densify, Cast.ai for AKS) — automate the p95-driven sizing checks in section 1. Cast.ai specifically can auto-rebin AKS pods onto cheaper Spot nodes.
We don't currently take affiliate commissions on these — if a tool comes up in an audit recommendation, it's because it fits the workload, not because it pays a referral.
Common Azure cost anti-patterns
"Business Critical by default" on Azure SQL. A team chose Business Critical for the primary database during launch when latency mattered, never revisited. Most OLTP workloads run fine on General Purpose with the right Storage Performance setting at half the cost.
Cosmos DB provisioned at peak. RU/s set to peak observed throughput during a launch event 18 months ago. Autoscale RU/s typically cuts this 40–60% with no perf impact for variable workloads.
Sentinel ingestion blanket-enabled. Every connector enabled "just in case," 90-day retention everywhere, no commitment tier set. This single subsystem often runs 5–8% of total Azure spend.
Premium SSD v1 everywhere. Premium SSD v1 was chosen during early Azure adoption when v2 didn't exist. Migration to Premium SSD v2 or Standard SSD is free and saves 30–50% on most disk workloads.
3-year RIs locked at wrong scale. A 3-year RI bought during peak headcount that no longer matches the active baseline. The reservation cost is sunk — but it perversely incentivizes running idle compute to "use" the commitment.
NAT-pulled container images. AKS clusters pull from Docker Hub or a different-region ACR through NAT Gateway. At scale this single pattern can cost more per month than the cluster's compute. Regionalize ACR.
30-day Azure cost optimization plan
Days 1–3: Measure. Export 14 days of cost data by service, resource group, and tag. Pull Azure Advisor recommendations. Note the top 10 line items and the top 5 Advisor recommendations by impact.
Days 4–7: Quick wins. Enable Blob lifecycle on every major container. Delete orphaned disks and snapshots. Set per-table Log Analytics retention. These are zero-risk and pay for the rest of the work.
Days 8–14: Rightsize. Implement every Advisor recommendation rated "High" impact. Downsize the top quintile of underutilized VMs and SQL databases based on p95 data. Re-benchmark before committing.
Days 15–21: Auto-shutdown and AKS scaling. Schedule non-production VM shutdowns. Tune AKS autoscaler min/max. Move appropriate AKS workloads to Spot node pools.
Days 22–25: Networking. Audit NAT Gateway traffic, regionalize ACR, add Private Endpoints to chatty PaaS services. This is the hardest section — focus on the top 3 egress contributors only.
Days 26–30: Commit the baseline. With the new (lower) stable baseline established, purchase the right Reservation/Savings Plan mix. Validate Azure Hybrid Benefit is fully applied.
Re-measure in 60 days. If the bill hasn't moved 20%, the bottleneck is almost always organizational — either nobody owns the optimization work, or the team owning it doesn't have authority to change workloads. The fix at that point is governance, not technology.
Want a prioritized Azure action list for your specific tenant?
Run a focused FinOps audit — we take this checklist, layer it against your actual usage and bill, and return a ranked action plan with dollar-impact estimates within 5–7 business days. Free.