About
DevLeep exists because most DevOps training teaches you commands but never puts you in a situation where something is actually broken and the clock is ticking.
You can finish a 40-hour Kubernetes course and still freeze the first time a production pod enters a crash loop. That is not your fault — it is the fault of how the training was designed.
Sandboxed environments give you a clean, predictable system. Production gives you a system that has been running for months, touched by a dozen people, and is failing in a way the documentation never described.
The gap between "I know the commands" and "I can handle an incident" is experience. DevLeep is how you get that experience without waiting for it to happen to you at 2 AM.
Your infrastructure
Labs run in your own AWS account. We provision a real EC2 instance, real VPC, real disk. When you fix the problem, you are fixing it on actual infrastructure — not a simulator.
Real incident scenarios
Every lab is based on a class of production incident: disk exhaustion, misconfigured nginx, OOM kills, failed deployments, certificate expiry. Each one has a root cause and a correct fix.
Automated validation
When you think you have fixed the incident, you run the validation suite. It executes real checks against your environment — no self-reporting, no honour system.
Open lab definitions
The lab definitions — YAML files that describe the scenario, objectives, hints, and validation rules — are open source. Anyone can read them, improve them, or contribute new ones.
Three tracks, each progressing from orientation through intermediate diagnostics to advanced failure modes.
Linux Core
File systems, processes, networking, permissions, systemd, cron, disk management
Containers
Docker internals, failed builds, network isolation, volume mounts, compose failures
Kubernetes
Pod failures, scheduler issues, RBAC, ingress, persistent volumes, cluster degradation
The lab definition format is public. Every scenario, every validation check, every hint is a YAML file in a public repository. We believe incident training improves when the community can inspect and improve the material.
If you have been through a real outage and want to turn it into a lab, the contributing guide explains the format in full.