Ops 4: Container Orchestration

The VP of Engineering attended KubeCon and came back a changed person. A company-wide Slack message confirmed that “all services will migrate to Kubernetes by Q3.” When you pointed out that the Minecraft server is not a business-critical production service, you were told that “all means all.” There was a follow-up message clarifying that this includes the Minecraft server specifically.

Your Docker Compose deployment from Ops 3 works, but it still assumes one host, one runtime, and an application-layer recovery story that lives mostly outside the orchestrator. You are not being asked to rebuild the service from scratch. You are being asked to migrate that existing Minecraft deployment onto Kubernetes in a controlled way: preserve the artifact chain, preserve the world data, and add the declarative rollout and recovery controls leadership now expects.

Learning Objectives

Deploy and operate a stateful workload in Kubernetes using k3s.
Apply Kubernetes primitives correctly: Services, ConfigMaps/Secrets, health probes, and resource controls.
Preserve and protect world state across pod restarts and node reboots.
Demonstrate rollout, rollback, and failure recovery as operational procedures.

Constraints (AWS Academy)

You must use AWS Academy resources only.
Start from your Ops 3 baseline: reuse or adapt your Terraform/OpenTofu project, pinned ECR image source, S3 backup location, and IAM instance profile pattern.
Kubernetes must run on EC2; use k3s unless your instructor explicitly approves an alternative.
Infrastructure must be provisioned via Terraform/OpenTofu.
Use an IAM instance profile for AWS access from the EC2 host, including S3 backup access and any ECR authentication path you configure. Do not hardcode AWS credentials in manifests, Kubernetes Secrets, or environment variables.
Minimize public exposure: restrict SSH access to a known source, and open only ports required for Minecraft.
Document cost controls: instance size, stop schedule, and at least one additional guardrail.
State handling must be explicit and defensible; for k3s single-node, simplicity is acceptable if well-justified.

Requirements

A. Provisioning

Terraform/OpenTofu provisions an EC2 host that runs k3s.
Start from your Ops 3 infrastructure code. You may refactor Docker/Compose-specific host setup into k3s bootstrap, but the deployment must remain rebuildable from code.
You may reuse Ansible or cloud-init for node bootstrap and backup/restore plumbing, but workload configuration must live in Kubernetes manifests or Helm values rather than ad hoc shell commands.
Security Group rules are minimal and justified: SSH restricted to a known source; TCP 25565 open for Minecraft; no unnecessary ports.

B. Kubernetes Deployment

Minecraft runs in Kubernetes using a workload controller appropriate for a single-replica stateful service. A Deployment with a PVC is acceptable on single-node k3s if you justify the tradeoff; a StatefulSet is also acceptable. Use the image you published in Assignment 2 or built via your CI/CD pipeline from Assignment 3. Reference a specific pinned tag, not latest.
Required configuration is delivered via ConfigMap/Secret as appropriate.
Expose Minecraft through a Kubernetes Service on TCP 25565. For the standard single-node k3s path, use a Service of type LoadBalancer so k3s ServiceLB binds 25565 on the host. Do not use NodePort or hostPort as the primary submission path.
Liveness and readiness probes are defined and justified in your documentation.
Probe choice must be defensible for a long-starting Java service. A startupProbe is recommended; if you omit it, explain how your timings avoid restart loops during startup.
Resource requests and limits are set and justified in your documentation.

C. Persistence and Safety

World data is stored on a persistent volume that survives pod deletion.
Your documentation must justify the persistence approach relative to single-node k3s tradeoffs.
World data is backed up to S3; the backup trigger or schedule is documented.
The restore procedure is step-by-step and verifiable by another operator.

D. Operational Demonstrations

You must demonstrate all of the following:

A rollout to a new image version.
A rollback to the previous version.
One failure drill (choose one):
- Node reboot: reboot the EC2 host; brief downtime during the reboot is acceptable on single-node k3s, but k3s and Minecraft must recover automatically with world data intact.
- Bad deploy: push a deployment with a broken or missing image tag; show rollout failure detection and the rollback process restoring service.
- Resource exhaustion: simulate a resource constraint (e.g., set extremely low memory limits); show OOM or restart detection and recovery.
During the drill, show at least one authoritative Kubernetes diagnostic view appropriate to the failure, such as kubectl describe, kubectl get events, kubectl rollout status/history, or kubectl logs.

E. Documentation

Your PDF must be usable by another operator with no prior knowledge of your setup. Required sections:

Architecture diagram showing EC2, k3s, all Kubernetes resources (Deployment or StatefulSet, Service, PVC, ConfigMap/Secret), ECR, and S3.
Runbook covering: deployment procedure, service exposure on 25565, rollout/rollback steps, backup procedure, and step-by-step restore from S3.
Tradeoff notes justifying your workload controller choice, persistence approach, service exposure choice, probe configuration, and resource limits.
Link to your version control repository and a concise file map identifying the exact submission files. This must cover the Terraform/OpenTofu code, Kubernetes manifests or Helm values, and any supporting automation you used (for example: Ansible, cloud-init, scripts). You may also include selected code blocks in the PDF, but the grader must be able to locate the exact submitted configuration quickly.
Teardown checklist to prevent runaway cost after the assignment ends.

Hints

k3s is a lightweight Kubernetes distribution designed for single-node operation, but stateful workloads like Minecraft require careful configuration.

You can and should reuse your Ops 3 Terraform, IAM instance profile pattern, S3 backup location, and image pipeline. The new work here is replacing Docker Compose with Kubernetes resources, not inventing a second artifact chain.
k3s uses containerd as its container runtime. Your ECR images push and pull normally, but k3s still needs an authentication path for private ECR pulls. On EC2, prefer a node-level approach that uses the attached IAM instance profile to obtain short-lived ECR credentials at pull time. If you instead use a Kubernetes imagePullSecret, treat it as a fallback and document how the temporary ECR token is created and refreshed.
Minecraft does not expose a standard HTTP health endpoint. A TCP socket probe (tcpSocket: { port: 25565 }) is an acceptable baseline when your chosen server image exposes no stronger health signal, but document the limitation: an open port is weaker evidence than true application readiness.
Minecraft startup can be slow, especially when loading a world. A startupProbe, or conservative liveness/readiness timing, can prevent Kubernetes from killing the server during JVM warmup.
The default k3s storage class (local-path) provisions volumes on the node’s local disk. This is sufficient for this assignment; document the tradeoff relative to a cloud-managed persistent volume.
For public exposure in this assignment, prefer a LoadBalancer Service on port 25565. On single-node k3s, ServiceLB (klipper-lb) binds the service port directly on the host when that port is available.
Ingress is primarily for HTTP and HTTPS routing. Minecraft does not need it for the primary exposure path in this assignment.

You may use any Minecraft server software you like. The key is that your documentation is clear and reproducible for another operator.

What You’ll Submit

Architecture and Operations Documentation (PDF): covers your Kubernetes architecture, runbook, tradeoff decisions, repository link and file map for the submitted automation/configuration, and teardown checklist. Another operator must be able to deploy, operate, roll back, and restore the server from this document without asking you questions.
Narrated screen recording (max 3 minutes). Your server MOTD must include your name or student ID. Submit timestamps alongside the video (e.g., “Checkpoint 1: 0:00, Checkpoint 2: 0:38, …”):
1. kubectl get nodes and kubectl get pods showing the k3s node and Minecraft pod running, then nmap -sV -Pn -p T:25565 <public-endpoint> showing 25565/tcp open with your custom MOTD.
kubectl delete pod <minecraft-pod> followed by kubectl get pods showing the replacement pod come up, then evidence that the replacement pod mounts the same PVC and that the world directory still contains your data. If you need a concrete verification pattern, adapt the persistence-check approach from Ops 2, section B.
Roll out a new image version (e.g., kubectl set image or a manifest update), confirm it deploys, then roll back to the previous version and confirm the server is joinable after rollback.
Execute your chosen failure drill: introduce the failure, show authoritative diagnostic output (kubectl describe, events, rollout status/history, logs, or equivalent), execute recovery, and confirm the server is joinable with world data intact.

Rubric

Always refer to Canvas for the most up-to-date rubric information. Canvas's rubric will be used for grading.

Container Orchestration (Total: 100 pts)
Criteria	Ratings
Video: Kubernetes running and reachable (10) Video checkpoint 1: kubectl get nodes and kubectl get pods show the k3s node and Minecraft pod running; nmap -sV -Pn -p T:25565 <public-endpoint> shows 25565/tcp open with Minecraft responding; MOTD contains name or student ID.	10 pts Complete All three elements clearly shown: kubectl output showing k3s node and Minecraft pod running, nmap output showing 25565/tcp open with Minecraft responding, and MOTD containing name or student ID. 5 pts Partial Two of three elements clearly shown, or one is ambiguous (e.g., pod status unclear, MOTD missing name/ID, or nmap output incomplete). 0 pts Missing No credible evidence of Minecraft running on k3s and reachable on 25565.
Video: Persistence after pod deletion (10) Video checkpoint 2: kubectl delete pod shown removing the Minecraft pod; kubectl get pods shows the replacement pod coming up; directory listing or equivalent confirms world data is present on the persistent volume after recovery.	10 pts Complete All three elements clearly shown: pod deletion, replacement pod coming up, and world data confirmed present on the persistent volume. 5 pts Partial Two of three elements clearly shown, or one is ambiguous (e.g., pod deletion shown but data verification is missing, or the replacement pod recovery is not confirmed). 0 pts Missing No credible persistence demonstration.
Video: Rollout and rollback (10) Video checkpoint 3: a new image version is deployed and confirmed running; a rollback to the previous version is executed and confirmed; the server is joinable (nmap or equivalent) after rollback.	10 pts Complete All three elements clearly shown: new version deployed and confirmed, rollback executed, and server confirmed joinable after rollback. 5 pts Partial Two of three elements clearly shown, or one is ambiguous (e.g., version change not clearly evidenced, rollback not shown, or post-rollback service not confirmed). 0 pts Missing No credible rollout or rollback demonstrated.
Video: Failure drill (10) Video checkpoint 4: failure is clearly introduced; authoritative diagnostic output is shown (for example, kubectl describe, events, rollout status/history, or logs); recovery is executed; server is confirmed joinable with world data intact after recovery.	10 pts Complete All four elements clearly shown: failure introduced, authoritative diagnostic output shown, recovery executed, and server confirmed joinable with world data intact. 5 pts Partial Three of four elements clearly shown, or one is ambiguous (e.g., failure introduction unclear, diagnostic output not shown, or world data not confirmed after recovery). 0 pts Missing No credible failure drill demonstrated.
Workload: k3s manifests and image configuration (15) Evaluated on four elements: (1) workload controller choice is appropriate for a single-replica stateful service and references a pinned ECR image tag, not latest, (2) ConfigMap/Secret delivers required server configuration, (3) a LoadBalancer Service exposes Minecraft on 25565 using k3s ServiceLB, (4) Terraform/OpenTofu provisions the EC2 host and k3s setup.	15 pts All four elements All four elements present and correctly implemented. 11 pts Three elements Three of four elements present; one minor gap. 7 pts Two elements Two of four elements present; significant missing configuration in manifests or provisioning. 0 pts One or zero elements Fewer than two elements present or manifests do not produce a running service.
Persistence: storage and backup/restore (15) Evaluated on three elements: (1) world data stored on a persistent volume that survives pod deletion, (2) S3 backup procedure implemented and documented with trigger or schedule, (3) step-by-step restore procedure documented and verifiable by another operator.	15 pts All three elements All three elements present and correctly implemented. 10 pts Two elements Two of three elements present; one gap (e.g., backup implemented but restore steps are missing, or persistence mechanism is present but undocumented). 5 pts One element Only one element clearly present; persistence or backup/restore strategy is significantly incomplete. 0 pts Missing World data does not persist, or no backup/restore process is documented or implemented.
Operations: probes, resource controls, and security (15) Evaluated on four elements: (1) liveness and readiness probes defined and justified in documentation, (2) startup behavior is handled defensibly for a long-starting Java service, either with a startupProbe or clearly justified timing choices, (3) resource requests and limits set with documented justification, (4) Security Group rules are minimal (SSH restricted, only 25565 open publicly) and the IAM instance profile handles S3 access without hardcoded credentials.	15 pts All four elements All four elements present and correctly implemented. 11 pts Three elements Three of four elements present; one gap (e.g., startup behavior not justified, resource limits weakly documented, or security group has unnecessary open ports). 7 pts Two elements Two of four elements present; significant gaps in operational controls or security posture. 0 pts One or zero elements Fewer than two elements present or operational controls are not meaningfully justified.
Documentation: architecture, runbook, and teardown (15) All four required sections present and usable by another operator: (1) architecture diagram showing EC2, k3s, Kubernetes resources (Deployment or StatefulSet, Service, PVC, ConfigMap/Secret), ECR, and S3, (2) runbook covering deployment, exposure on 25565, rollout/rollback, and backup/restore procedures, (3) tradeoff notes justifying workload controller choice, persistence approach, service exposure choice, probe configuration, and resource limits, (4) teardown checklist to prevent runaway cost.	15 pts Exemplary All four sections present, accurate, and usable by another operator without asking clarifying questions. 12 pts Proficient Three of four sections complete or one has a minor gap; overall usable by another operator. 8 pts Developing Two of four sections complete or multiple sections are too vague to act on; teardown checklist missing. 0 pts Insufficient Fewer than two sections present or documentation is not usable by another operator.

Extra Credit (up to +10)

Helm Chart (+5): Package the Minecraft deployment as a Helm chart with configurable values (server properties, resource limits, image tag). Show that helm install and helm upgrade work correctly.
Network Policy (+5): Define a Kubernetes NetworkPolicy that restricts pod-level traffic for the Minecraft workload to only the required ports and peers. Document what it blocks, what it does not block, and why. Do not treat it as a replacement for Security Group or Service exposure controls.

Extra credit must stay within this assignment’s Kubernetes scope (no additional cloud services or multi-node cluster setups).