Docs

Documentation

Lab Definition Format

Every Devleep lab is a single YAML file. The file defines the scenario, environment, learning objectives, hint progression, and the automated checks that verify the fix. This page is the complete reference for writing a lab.

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Overview

A lab is a YAML file that describes a broken or incomplete environment and the criteria for fixing it. When a student starts a lab, the platform provisions the infrastructure defined by the lab's terraform_module, runs the scenario setup, and connects the student to a live terminal. When the student submits, the platform runs the validation checks defined in the YAML against the live instance.

As a lab author your job is the YAML file. You describe the scenario, write the validation checks, and provide hints. The platform handles provisioning, terminal access, and grading.

schema_version: 1— all labs must declare this at the top

terraform_module— which infrastructure template to provision (see Terraform Modules)

validation.checks— SSH commands and HTTP checks that run against the live instance

Lab Schema

Fields marked * are required.

Identity & Classification

schema_version*integer

Always 1. Must be the first field in the file.

id*string

Unique slug used as the URL path. kebab-case. Must be unique across all labs.

title*string

Human-readable lab title. Keep it under 80 characters.

description*string

One or two sentences describing the scenario. Shown on the catalog card and before the lab starts.

track*string

Track this lab belongs to: linux · docker · kubernetes · ansible · git · jenkins.

module*string

Module within the track. e.g. linux-fundamentals · linux-administration · linux-security.

sort_order*integer

Position in the curriculum. Use multiples of 10 (10, 20, 30…) to leave room for future labs.

difficulty*enum

beginner · intermediate · advanced · expert

engineer_level*enum

L1 · L2 · L3 · L4 — maps beginner through expert.

mode*enum

guided — step-by-step tasks. incident — broken environment to diagnose and fix. challenge — open-ended, no hints.

tags*string[]

Keywords for search and filtering. Use 4–6 tags per lab.

Timing & Infrastructure

estimated_minutes*integer

Expected time to complete. Shown on the catalog card — does not auto-end the session.

timeout_minutes*integer

Hard session timeout. Infrastructure is destroyed after this many minutes regardless of completion.

terraform_module*string

Infrastructure template to provision. See the Terraform Modules section for available options.

scenario_id*string

Identifies which scenario setup script to run on the instance after provisioning. e.g. broken-nginx · disk-full · hello-linux

prerequisites*string[]

Lab IDs that should be completed before this one. Use an empty list [] if there are none.

Learning Content

objectives*string[]

3–5 specific learning objectives shown before the lab starts. Be concrete — 'identify the cause of X' not 'learn about X'.

success_criteria*string[]

Plain-English list of what passing looks like. Mirrors the validation checks in human-readable form.

hints*Hint[]

Progressive hint levels (1–3). See the Hint System section.

hint_policy*enum

show_level_1_automatically — Level 1 is shown on load. manual_only — student must request all hints.

validation*Validation

Automated checks that run against the live instance. See the Validation section.

completion_messagestring

2–3 sentences shown after all checks pass. Reinforce the key lesson.

deliverablesDeliverable[]

Explicit files or outputs the student must produce. Each maps to one or more validation checks.

metadata block (optional)

Include this block when submitting — it helps the review team track authorship and version history.

metadata:
  version: 1.0.0
  author: your-name
  reviewed_by: []
  last_updated: 2026-06-14

environment block (optional)

Declare the infrastructure the lab requires. Used for cost estimation and catalog display.

environment:
  instance_type: t3.micro
  estimated_cost: "$0.01"
  aws_services:
    - ec2

deliverables block (optional)

path*string

Absolute path on the instance. e.g. /etc/systemd/system/myapp.service

description*string

What the file must contain. Written as a checklist — each item maps to a validation check.

Briefing & Evidence

The briefing: block gives incident and challenge labs their operational context. It replaces the description text with a structured scenario that reads like a real alert. The evidence: block (top-level, separate from briefing) adds flavour cards — PagerDuty alerts, Slack messages, tickets, and notes.

briefing fields

severity*enum

sev-1 (service down, production impact) · sev-2 (degraded) · sev-3 (non-critical) · sev-4 (no immediate impact / training task)

impact*string

One-line blast radius. e.g. Web service returning 500 on all endpoints.

title*string

Short alert title. e.g. Permission Chaos Detected.

narrative*string

3–5 sentences of operational context. What broke, when, what is affected, what the student needs to do.

evidence items

Evidence types

pagerdutytype

Renders as a PagerDuty alert card.

slacktype

Renders as a Slack message card.

emailtype

Renders as an email message.

notetype

Renders as a plain ticket or note.

briefing:
  severity: sev-2
  impact: Web service returning 500 on all endpoints
  title: Permission Chaos Detected
  narrative: |
    A junior engineer ran sudo chown -R root:root /var/www/html.
    nginx runs as www-data and can no longer read any files.
    Every request returns 500. Fix permissions — no restarts until you
    understand what broke.

evidence:
  - type: pagerduty
    title: "P2: nginx 500 on all requests"
    content: |
      CRITICAL: prod-web-01 — all requests returning 500
      nginx error log: Permission denied reading /var/www/html
  - type: slack
    title: "#ops-alerts"
    content: |
      [automated] nginx 500 rate: 100% — prod-web-01
      First seen: 14:23 UTC

Validation Checks

Checks run against the live EC2 instance when the student clicks Submit. Every check in the array must pass (or at least one with strategy: any) for the lab to be marked complete. Transient failures are retried up to 3 times automatically.

Validation Block Structure

validation:
  strategy: all        # all checks must pass (default). use 'any' for alternatives.
  default_timeout_seconds: 5
  checks:
    - id: nginx-running
      name: Nginx Running
      type: output_matches
      cmd: systemctl is-active nginx
      contains: active
      failure_hint: Run 'sudo systemctl restart nginx' and check 'journalctl -xe'

Check Fields

id*string

Unique identifier within the lab (kebab-case). Shown in validation results.

name*string

Human-readable display name shown while the check runs.

type*enum

Check type: output_matches · ssh · http_get · multi_check

cmdstring

Shell command to run via SSH (output_matches, ssh).

urlstring

HTTP URL to request (http_get). Supports {{TEMPLATE_VAR}} substitution.

containsstring

Output must contain this substring (output_matches).

patternstring

Output must match this regex (output_matches). Takes priority over contains.

valuestring

Output must equal this exactly, trimmed (output_matches). Lowest priority.

expect_statusinteger

Expected HTTP status code (http_get). Default: any 2xx.

operatorenum

multi_check logic: all (default) or any.

checksCheck[]

Sub-checks for multi_check type.

failure_hintstring

Shown to the student when this check fails. Write it as a command to run, not a description of what is wrong.

output_matches

Runs cmd via SSH. Passes if exit code is 0 and the output satisfies the match criteria. Covers 90% of validation needs.

- id: nginx-running
  name: Nginx Running
  type: output_matches
  cmd: systemctl is-active nginx
  contains: active
  failure_hint: Run 'sudo systemctl restart nginx' and check 'journalctl -xe'

- id: disk-below-90
  name: Disk Below 90%
  type: output_matches
  cmd: df / | awk 'NR==2 {gsub("%",""); print ($5 < 90) ? "ok" : "full"}'
  value: ok
  failure_hint: Run 'du -sh /* 2>/dev/null | sort -h' to find what is using space

ssh

Runs cmd via SSH. Passes if exit code is 0. Output is not checked — use for existence checks where you only care whether the command succeeds.

- id: config-exists
  name: Config File Exists
  type: ssh
  cmd: test -f /etc/myapp/config.yaml
  failure_hint: Create the config file at /etc/myapp/config.yaml

- id: service-unit-present
  name: Unit File Written
  type: ssh
  cmd: test -f /etc/systemd/system/myapp.service
  failure_hint: Write the unit file to /etc/systemd/system/myapp.service

http_get

Makes an HTTP GET to url. Passes if the response is 2xx, or matches expect_status if set.

- id: app-responding
  name: App HTTP Responding
  type: http_get
  url: "http://{{EC2_IP}}:8080/health"
  failure_hint: Check the app with 'systemctl status myapp' and 'journalctl -u myapp -n 50'

- id: auth-enforced
  name: Auth Required on Admin Route
  type: http_get
  url: "http://{{EC2_IP}}:8080/admin"
  expect_status: 401

multi_check

Runs a list of sub-checks. With operator: all every sub-check must pass. With operator: any at least one must pass.

- id: stack-healthy
  name: Full Stack Healthy
  type: multi_check
  operator: all
  checks:
    - id: nginx-active
      name: Nginx Active
      type: output_matches
      cmd: systemctl is-active nginx
      contains: active
    - id: app-active
      name: App Active
      type: output_matches
      cmd: systemctl is-active myapp
      contains: active

Writing Good Checks

✓Test the outcome, not the method. systemctl is-active nginx is better than checking if a config file was written.

✓Reduce output to something predictable — pipe through awk to produce ok / fail, or use grep -c to produce a count.

✓Write failure_hint as a command to run, not a description of what is wrong.

✗Do not use ps aux | grep myapp — grep matches itself. Use systemctl is-active or pgrep -f.

✗Do not hardcode IPs. Use {{EC2_IP}} in http_get URLs.

Hint System

Every lab requires three hint levels. The levels are progressive — each reveals more than the last. Good hints are the difference between a student who learns and one who rage-quits.

Level 1

Directional

Points the student at the right area without giving commands. What should they be looking at? What questions should they be asking? Shown automatically on lab load by default.

Level 2

Specific

Names the relevant commands and flags. Still expects the student to piece together the fix from the hints.

Level 3

Full walkthrough

Every command, in order, with a one-line explanation of why each step is necessary. A student who is completely stuck should be able to follow this to completion.

hint_policy: show_level_1_automatically
hints:
  - level: 1
    text: |
      Something swept ownership across a directory it should not have touched.
      Find out which directory is affected and what nginx needs to read it.

  - level: 2
    text: |
      Check what nginx is trying to read:
        sudo journalctl -u nginx --since "5 min ago"
        ls -la /var/www/html

      nginx runs as www-data. Fix ownership:
        sudo chown -R www-data:www-data /var/www/html
        sudo chmod 755 /var/www/html

  - level: 3
    text: |
      # Confirm the problem
      sudo journalctl -u nginx --since "5 min ago" | grep "Permission denied"
      ls -la /var/www/html   # should show root:root — that is wrong

      # Fix ownership
      sudo chown -R www-data:www-data /var/www/html
      sudo chmod 755 /var/www/html
      sudo find /var/www/html -type f -exec chmod 644 {} \;

      # Verify nginx can now read the files
      sudo nginx -t
      curl -s -o /dev/null -w "%{http_code}" http://localhost

Terraform Modules

The terraform_module field tells the platform which infrastructure template to provision for your lab. Choose the module that matches your scenario — you do not write or modify Terraform.

labs/linux-ec2stable

Single Ubuntu 22.04 EC2 instance (t3.micro). Connects via Cloudflare Tunnel. Used for all Linux track labs. The scenario_id determines what is broken on the instance.

Template variables: EC2_IP, INSTANCE_ID

labs/docker-ec2planned

EC2 instance with Docker and Docker Compose pre-installed. Broken container configurations are injected via userdata. For Containers track labs.

Template variables: EC2_IP, INSTANCE_ID

labs/kubernetes-eksplanned

EKS cluster with a pre-configured broken workload. For Kubernetes track labs. Note: EKS provisions take 8–12 minutes.

Template variables: CLUSTER_ENDPOINT, CLUSTER_NAME

Full Example

A complete incident-mode lab — all required fields, a briefing, evidence cards, deliverables, and validation checks:

schema_version: 1

metadata:
  version: 1.0.0
  author: your-name
  reviewed_by: []
  last_updated: 2026-06-14

id: linux-disk-full
title: "Disk Full: Production Filesystem Saturated"
description: >
  The root filesystem is at 100%. Every write is failing.
  Find what consumed the disk, clear space, and restore the service.

track: linux
module: linux-storage
sort_order: 80
difficulty: intermediate
engineer_level: L2
mode: incident

tags:
  - linux
  - disk
  - storage
  - troubleshooting
  - incident

estimated_minutes: 45
timeout_minutes: 90
terraform_module: labs/linux-ec2
scenario_id: disk-full

environment:
  instance_type: t3.micro
  estimated_cost: "$0.01"
  aws_services:
    - ec2

prerequisites:
  - linux-filesystem-anatomy

briefing:
  severity: sev-1
  impact: Service down — production impact
  title: Filesystem 100% — write errors
  narrative: |
    The production web server is returning 500 errors on every request.
    The app cannot write logs or temp files. The root filesystem hit 100%.
    Identify what consumed the disk, clear space, and restore the service.

evidence:
  - type: pagerduty
    title: "P1: Filesystem 100% — write errors"
    content: |
      CRITICAL: / at 100% on prod-web-01
      App is throwing ENOSPC on every write
      Services failing: nginx, app, cron
  - type: slack
    title: "#ops-alerts"
    content: |
      [automated] ALERT: prod-web-01 disk usage 100%
      First seen: 14:23 UTC — still firing

objectives:
  - Identify what is consuming disk space
  - Remove the bloat without deleting production data
  - Restore service without a restart

success_criteria:
  - df / shows less than 90% used
  - nginx is active and responding
  - App is responding on port 8080

deliverables:
  - path: /tmp/disk-cleanup.log
    description: |
      A log of what you found and what you removed:
        - Output of du -sh showing the bloat
        - Commands run to free space
        - df -h after cleanup

hint_policy: show_level_1_automatically
hints:
  - level: 1
    text: |
      Something wrote a lot of data somewhere.
      Find out what is large. Find out what is growing.

  - level: 2
    text: |
      df -h — which filesystem is full?
      sudo du -sh /* 2>/dev/null | sort -h — what is largest?
      sudo journalctl --disk-usage — are logs the culprit?
      ls -lhS /tmp /var/log /var/cache — spot large files

  - level: 3
    text: |
      df -h
      sudo du -sh /* 2>/dev/null | sort -h
      sudo journalctl --vacuum-size=200M
      sudo find /var/log -name "*.gz" -delete
      sudo apt-get clean
      df -h
      sudo systemctl restart nginx

validation:
  strategy: all
  default_timeout_seconds: 5
  checks:
    - id: disk-below-90
      name: Disk Below 90%
      type: output_matches
      cmd: df / | awk 'NR==2 {gsub("%",""); print ($5 < 90) ? "ok" : "full"}'
      value: ok
      failure_hint: Run 'du -sh /* 2>/dev/null | sort -h' to find what is using space

    - id: nginx-running
      name: Nginx Running
      type: output_matches
      cmd: systemctl is-active nginx
      contains: active
      failure_hint: Start nginx with 'sudo systemctl start nginx'

    - id: app-responding
      name: App HTTP Responding
      type: http_get
      url: "http://{{EC2_IP}}:8080/health"
      failure_hint: Check app logs with 'journalctl -u myapp -n 50'

completion_message: |
  A full disk silently kills services that try to write.
  You can now isolate the cause, clean safely, and verify the fix.
  Set up disk monitoring so you see this before it hits 100%.

Submit a Lab

Good labs come from real incidents. If you have been through an outage and want to turn it into a lab, here is what to prepare:

Write the YAML

Follow the schema on this page. Use an existing lab as a starting point. Declare schema_version: 1 at the top.

Choose the right module

Pick the terraform_module that matches your scenario (linux-ec2 for most Linux labs). Pick or request a scenario_id for what gets broken on the instance.

Write all three hints

Level 1 directional, Level 2 specific commands, Level 3 full walkthrough. A lab without a complete Level 3 will be sent back.

Write tight checks

Every validation check must test the outcome, not the method. Verify your commands produce consistent output before submitting.

Send your YAML

Submit the file to the Devleep team for review. The team will test the scenario, wire up the infrastructure, and publish the lab.

Note: You do not need access to the platform codebase, Terraform, or the database. Write the YAML and send it — the team handles the rest.