Documentation
Lab Definition Format
Every Devleep lab is a single YAML file. The file defines the scenario, environment, learning objectives, hint progression, and the automated checks that verify the fix. This page is the complete reference for writing a lab.
Overview
A lab is a YAML file that describes a broken or incomplete environment and the criteria for fixing it. When a student starts a lab, the platform provisions the infrastructure defined by the lab's terraform_module, runs the scenario setup, and connects the student to a live terminal. When the student submits, the platform runs the validation checks defined in the YAML against the live instance.
As a lab author your job is the YAML file. You describe the scenario, write the validation checks, and provide hints. The platform handles provisioning, terminal access, and grading.
Lab Schema
Fields marked * are required.
Identity & Classification
Timing & Infrastructure
Learning Content
metadata block (optional)
Include this block when submitting — it helps the review team track authorship and version history.
metadata: version: 1.0.0 author: your-name reviewed_by: [] last_updated: 2026-06-14
environment block (optional)
Declare the infrastructure the lab requires. Used for cost estimation and catalog display.
environment:
instance_type: t3.micro
estimated_cost: "$0.01"
aws_services:
- ec2deliverables block (optional)
Briefing & Evidence
The briefing: block gives incident and challenge labs their operational context. It replaces the description text with a structured scenario that reads like a real alert. The evidence: block (top-level, separate from briefing) adds flavour cards — PagerDuty alerts, Slack messages, tickets, and notes.
briefing fields
evidence items
briefing:
severity: sev-2
impact: Web service returning 500 on all endpoints
title: Permission Chaos Detected
narrative: |
A junior engineer ran sudo chown -R root:root /var/www/html.
nginx runs as www-data and can no longer read any files.
Every request returns 500. Fix permissions — no restarts until you
understand what broke.
evidence:
- type: pagerduty
title: "P2: nginx 500 on all requests"
content: |
CRITICAL: prod-web-01 — all requests returning 500
nginx error log: Permission denied reading /var/www/html
- type: slack
title: "#ops-alerts"
content: |
[automated] nginx 500 rate: 100% — prod-web-01
First seen: 14:23 UTCValidation Checks
Checks run against the live EC2 instance when the student clicks Submit. Every check in the array must pass (or at least one with strategy: any) for the lab to be marked complete. Transient failures are retried up to 3 times automatically.
Validation Block Structure
validation:
strategy: all # all checks must pass (default). use 'any' for alternatives.
default_timeout_seconds: 5
checks:
- id: nginx-running
name: Nginx Running
type: output_matches
cmd: systemctl is-active nginx
contains: active
failure_hint: Run 'sudo systemctl restart nginx' and check 'journalctl -xe'Check Fields
output_matches
Runs cmd via SSH. Passes if exit code is 0 and the output satisfies the match criteria. Covers 90% of validation needs.
- id: nginx-running
name: Nginx Running
type: output_matches
cmd: systemctl is-active nginx
contains: active
failure_hint: Run 'sudo systemctl restart nginx' and check 'journalctl -xe'
- id: disk-below-90
name: Disk Below 90%
type: output_matches
cmd: df / | awk 'NR==2 {gsub("%",""); print ($5 < 90) ? "ok" : "full"}'
value: ok
failure_hint: Run 'du -sh /* 2>/dev/null | sort -h' to find what is using spacessh
Runs cmd via SSH. Passes if exit code is 0. Output is not checked — use for existence checks where you only care whether the command succeeds.
- id: config-exists name: Config File Exists type: ssh cmd: test -f /etc/myapp/config.yaml failure_hint: Create the config file at /etc/myapp/config.yaml - id: service-unit-present name: Unit File Written type: ssh cmd: test -f /etc/systemd/system/myapp.service failure_hint: Write the unit file to /etc/systemd/system/myapp.service
http_get
Makes an HTTP GET to url. Passes if the response is 2xx, or matches expect_status if set.
- id: app-responding
name: App HTTP Responding
type: http_get
url: "http://{{EC2_IP}}:8080/health"
failure_hint: Check the app with 'systemctl status myapp' and 'journalctl -u myapp -n 50'
- id: auth-enforced
name: Auth Required on Admin Route
type: http_get
url: "http://{{EC2_IP}}:8080/admin"
expect_status: 401multi_check
Runs a list of sub-checks. With operator: all every sub-check must pass. With operator: any at least one must pass.
- id: stack-healthy
name: Full Stack Healthy
type: multi_check
operator: all
checks:
- id: nginx-active
name: Nginx Active
type: output_matches
cmd: systemctl is-active nginx
contains: active
- id: app-active
name: App Active
type: output_matches
cmd: systemctl is-active myapp
contains: activeWriting Good Checks
systemctl is-active nginx is better than checking if a config file was written.awk to produce ok / fail, or use grep -c to produce a count.failure_hint as a command to run, not a description of what is wrong.ps aux | grep myapp — grep matches itself. Use systemctl is-active or pgrep -f.{{EC2_IP}} in http_get URLs.Hint System
Every lab requires three hint levels. The levels are progressive — each reveals more than the last. Good hints are the difference between a student who learns and one who rage-quits.
Directional
Points the student at the right area without giving commands. What should they be looking at? What questions should they be asking? Shown automatically on lab load by default.
Specific
Names the relevant commands and flags. Still expects the student to piece together the fix from the hints.
Full walkthrough
Every command, in order, with a one-line explanation of why each step is necessary. A student who is completely stuck should be able to follow this to completion.
hint_policy: show_level_1_automatically
hints:
- level: 1
text: |
Something swept ownership across a directory it should not have touched.
Find out which directory is affected and what nginx needs to read it.
- level: 2
text: |
Check what nginx is trying to read:
sudo journalctl -u nginx --since "5 min ago"
ls -la /var/www/html
nginx runs as www-data. Fix ownership:
sudo chown -R www-data:www-data /var/www/html
sudo chmod 755 /var/www/html
- level: 3
text: |
# Confirm the problem
sudo journalctl -u nginx --since "5 min ago" | grep "Permission denied"
ls -la /var/www/html # should show root:root — that is wrong
# Fix ownership
sudo chown -R www-data:www-data /var/www/html
sudo chmod 755 /var/www/html
sudo find /var/www/html -type f -exec chmod 644 {} \;
# Verify nginx can now read the files
sudo nginx -t
curl -s -o /dev/null -w "%{http_code}" http://localhostTerraform Modules
The terraform_module field tells the platform which infrastructure template to provision for your lab. Choose the module that matches your scenario — you do not write or modify Terraform.
labs/linux-ec2stableSingle Ubuntu 22.04 EC2 instance (t3.micro). Connects via Cloudflare Tunnel. Used for all Linux track labs. The scenario_id determines what is broken on the instance.
Template variables: EC2_IP, INSTANCE_ID
labs/docker-ec2plannedEC2 instance with Docker and Docker Compose pre-installed. Broken container configurations are injected via userdata. For Containers track labs.
Template variables: EC2_IP, INSTANCE_ID
labs/kubernetes-eksplannedEKS cluster with a pre-configured broken workload. For Kubernetes track labs. Note: EKS provisions take 8–12 minutes.
Template variables: CLUSTER_ENDPOINT, CLUSTER_NAME
Full Example
A complete incident-mode lab — all required fields, a briefing, evidence cards, deliverables, and validation checks:
schema_version: 1
metadata:
version: 1.0.0
author: your-name
reviewed_by: []
last_updated: 2026-06-14
id: linux-disk-full
title: "Disk Full: Production Filesystem Saturated"
description: >
The root filesystem is at 100%. Every write is failing.
Find what consumed the disk, clear space, and restore the service.
track: linux
module: linux-storage
sort_order: 80
difficulty: intermediate
engineer_level: L2
mode: incident
tags:
- linux
- disk
- storage
- troubleshooting
- incident
estimated_minutes: 45
timeout_minutes: 90
terraform_module: labs/linux-ec2
scenario_id: disk-full
environment:
instance_type: t3.micro
estimated_cost: "$0.01"
aws_services:
- ec2
prerequisites:
- linux-filesystem-anatomy
briefing:
severity: sev-1
impact: Service down — production impact
title: Filesystem 100% — write errors
narrative: |
The production web server is returning 500 errors on every request.
The app cannot write logs or temp files. The root filesystem hit 100%.
Identify what consumed the disk, clear space, and restore the service.
evidence:
- type: pagerduty
title: "P1: Filesystem 100% — write errors"
content: |
CRITICAL: / at 100% on prod-web-01
App is throwing ENOSPC on every write
Services failing: nginx, app, cron
- type: slack
title: "#ops-alerts"
content: |
[automated] ALERT: prod-web-01 disk usage 100%
First seen: 14:23 UTC — still firing
objectives:
- Identify what is consuming disk space
- Remove the bloat without deleting production data
- Restore service without a restart
success_criteria:
- df / shows less than 90% used
- nginx is active and responding
- App is responding on port 8080
deliverables:
- path: /tmp/disk-cleanup.log
description: |
A log of what you found and what you removed:
- Output of du -sh showing the bloat
- Commands run to free space
- df -h after cleanup
hint_policy: show_level_1_automatically
hints:
- level: 1
text: |
Something wrote a lot of data somewhere.
Find out what is large. Find out what is growing.
- level: 2
text: |
df -h — which filesystem is full?
sudo du -sh /* 2>/dev/null | sort -h — what is largest?
sudo journalctl --disk-usage — are logs the culprit?
ls -lhS /tmp /var/log /var/cache — spot large files
- level: 3
text: |
df -h
sudo du -sh /* 2>/dev/null | sort -h
sudo journalctl --vacuum-size=200M
sudo find /var/log -name "*.gz" -delete
sudo apt-get clean
df -h
sudo systemctl restart nginx
validation:
strategy: all
default_timeout_seconds: 5
checks:
- id: disk-below-90
name: Disk Below 90%
type: output_matches
cmd: df / | awk 'NR==2 {gsub("%",""); print ($5 < 90) ? "ok" : "full"}'
value: ok
failure_hint: Run 'du -sh /* 2>/dev/null | sort -h' to find what is using space
- id: nginx-running
name: Nginx Running
type: output_matches
cmd: systemctl is-active nginx
contains: active
failure_hint: Start nginx with 'sudo systemctl start nginx'
- id: app-responding
name: App HTTP Responding
type: http_get
url: "http://{{EC2_IP}}:8080/health"
failure_hint: Check app logs with 'journalctl -u myapp -n 50'
completion_message: |
A full disk silently kills services that try to write.
You can now isolate the cause, clean safely, and verify the fix.
Set up disk monitoring so you see this before it hits 100%.Submit a Lab
Good labs come from real incidents. If you have been through an outage and want to turn it into a lab, here is what to prepare:
Write the YAML
Follow the schema on this page. Use an existing lab as a starting point. Declare schema_version: 1 at the top.
Choose the right module
Pick the terraform_module that matches your scenario (linux-ec2 for most Linux labs). Pick or request a scenario_id for what gets broken on the instance.
Write all three hints
Level 1 directional, Level 2 specific commands, Level 3 full walkthrough. A lab without a complete Level 3 will be sent back.
Write tight checks
Every validation check must test the outcome, not the method. Verify your commands produce consistent output before submitting.
Send your YAML
Submit the file to the Devleep team for review. The team will test the scenario, wire up the infrastructure, and publish the lab.