Docker + ECR on EC2

The server crashed during the CEO’s first session. He lost a diamond pickaxe. An all-hands email with the subject line “ACCOUNTABILITY” was sent at 11:47 PM. The phrase “enterprise-grade reliability” now appears in your quarterly objectives.

Your manual setup works, but it drifts every time someone touches it, upgrades break in unpredictable ways, and recovery is a prayer-based workflow. Obsidian Dynamics now requires containerized, versioned, and recoverable operations that another operator can execute without improvisation.

Learning Objectives

Package a stateful service in Docker while preserving data correctly.
Publish and consume immutable image versions from ECR.
Execute safe upgrade and rollback procedures with explicit checks.
Implement S3 backup/restore with a defined retention policy.

Constraints (AWS Academy)

Compute remains on EC2.
Container images are stored in ECR.
Backups are stored in a private S3 bucket.
Security Group exposure must stay minimal and justified.

Requirements

Runtime Definition (Container + Service)

Provide a container runtime definition that runs the Minecraft server on EC2.
- docker compose is allowed but not required for a single container.
- Plain docker run, Podman, or equivalent is acceptable if reproducible.
Service must be reachable by clients on 25565/tcp.
- Verify reachability using:
  Terminal window
```
nmap -sV -Pn -p T:25565 <instance_public_ip>
```
Runtime configuration is externalized (environment variables and/or mounted config).
Service must come back automatically after host reboot.
- Acceptable mechanisms include container restart policies and/or generated systemd units (docker/podman workflows).
- You do not need to hand-write a systemd unit if your runtime tooling generates one.

State Boundary and Persistence

World data is stored outside the container image (host volume or bind mount).
Demonstrate persistence by restarting/recreating the container while preserving world state.
- Verification approach:
  1. Create a marker file in the world directory: touch world/PERSISTENCE_TEST
  2. Note the modification time of level.dat: stat -c '%Y' world/level.dat (or stat -f '%m' on macOS)
  3. Stop, remove, and recreate the container
  4. Verify marker file still exists and level.dat timestamp is unchanged
Clearly document what is stateful vs immutable.

ECR Publishing and Version Discipline

Publish your image to an ECR repository.
Baseline approach: Re-tag a trusted upstream Minecraft server image and publish it to your ECR repository.
- Document upstream image provenance and why it is trusted.
Define and use an immutable tagging scheme (for example: mc-1.21.1-build7).
latest may exist, but deployments must pin a specific version tag.
You must have at least two distinct versioned images published to ECR to demonstrate upgrade and rollback.

Safe Upgrade and Rollback Workflow

Provide a pre-change checklist that includes:
- Backup world data to S3 before any upgrade (mandatory).
- Verify both image versions exist in ECR.
Perform an upgrade from one pinned image version to another and validate success.
Perform rollback to the prior known-good image version without rebuilding from scratch.
Include post-change validation steps and expected healthy outcomes.

Backup/Restore to S3

Create a backup artifact for world data and upload it to S3.
S3 bucket must be private.
Configure at least one S3 lifecycle rule (e.g., expire backups older than 7 days) and document why you chose that retention period.
Demonstrate restore from S3 backup onto the service.

Operator Documentation

Your documentation must include:

Build + publish workflow (conceptual steps, not command dump only).
ECR usage expectations and image tag policy.
Runtime architecture diagram (simple is acceptable).
Backup and restore runbook for S3.
Upgrade and rollback runbook with explicit checks.

What You’ll Submit

Operator Runbook (PDF) containing all required documentation sections and answers to the reflection questions above.
Container deployment definition (docker-compose.yml or equivalent).
ECR reference + versioning policy (repository URI and tag strategy), plus image provenance notes.
Narrated screen recording (max 3 minutes) with a timestamp list for each checkpoint:
1. Show running container, nmap reachability on 25565/tcp, and reboot auto-recovery.
2. Persistence proof: place a marker, restart/recreate the container, show marker survives.
3. Upgrade to a new ECR image version, then rollback to the previous version.
4. S3 backup upload and restore demonstration.

Your server MOTD must include your name or student ID. Submit timestamps alongside the video.

Rubric

Always refer to Canvas for the most up-to-date rubric information. Canvas's rubric will be used for grading.

Docker + ECR on EC2 (Total: 100 pts)
Criteria	Ratings
Container runtime correctness (20) Scored on runtime definition and service operation: reproducible container run method (Compose optional), service starts on EC2, auto-recovers after EC2 reboot, and clients can reach `25565/tcp`. Video evidence required for reboot auto-recovery.	20 pts Exemplary Deployment definition is complete and reproducible; service starts cleanly, returns automatically after host reboot (shown in video), and reachability evidence is clear and correct. 16 pts Proficient Runtime and reachability mostly work; reboot auto-start is functional but video evidence has minor gaps. 10 pts Developing Partial runtime setup works, but major gaps remain in reboot behavior, deployment reproducibility, or reachability proof. 0 pts Insufficient Service is not reproducibly runnable in container form, does not auto-return after reboot, or reachability is not demonstrated.
Persistence boundary design (20) Scored on state handling: world data is externalized from the image, boundaries are explicit, and persistence is proven across container restart/recreate. Video evidence required showing world survives container lifecycle.	20 pts Exemplary State vs image boundary is explicit and correct; video evidence clearly shows world continuity across container restart/recreate. 16 pts Proficient Persistence approach is mostly correct; mechanism works but documentation or video proof has minor ambiguity. 10 pts Developing Some persistence mechanism exists, but boundary confusion or weak validation leaves reliability uncertain. 0 pts Insufficient World state is effectively tied to container lifecycle or persistence proof is missing.
ECR publishing and version discipline (20) Scored on artifact workflow: image is published to ECR, at least two distinct versioned images exist, tags are immutable and meaningful, and deployments pin explicit versions (not latest-only). Image provenance documented.	20 pts Exemplary ECR workflow is reproducible; at least two versioned images exist; version scheme is consistent and immutable; pinned deployments support deterministic rollback; provenance documented. 16 pts Proficient ECR use and versioning are mostly correct; two images exist but minor gaps in consistency, pinning practice, or provenance notes. 10 pts Developing ECR publication occurs but fewer than two images, version policy is weak, or pinning/provenance is incomplete. 0 pts Insufficient No credible ECR publish workflow, no version discipline for controlled deployment, or fewer than two images for rollback.
Upgrade and rollback execution (15) Scored on change process execution: pre-change checklist includes mandatory S3 backup, successful upgrade validation shown in video, and rollback to known-good version demonstrated without rebuild.	15 pts Exemplary Pre-change checklist is explicit and includes S3 backup; video shows upgrade execution and validation; rollback is executed and restores known-good behavior. 12 pts Proficient Change workflow is mostly complete; upgrade and rollback work but one area has weaker evidence (pre-check detail, validation, or rollback proof). 8 pts Developing Partial change workflow exists, but upgrade/rollback is incomplete, pre-change checklist missing S3 backup, or validation insufficient. 0 pts Insufficient No operationally credible upgrade+rollback procedure is demonstrated, or mandatory S3 backup step is missing from checklist.
S3 backup, restore, and retention (15) Scored on data protection: backup upload to private S3, defined retention/lifecycle policy, and verified restore path shown in video.	15 pts Exemplary Backups are uploaded to private S3, lifecycle policy is defined, and restore is successfully demonstrated in video. 12 pts Proficient Backup and restore workflow is mostly correct; one requirement is weak or partially evidenced (privacy, retention, or restore validation). 8 pts Developing Backup-related steps exist but privacy, retention, or restore validation is incomplete or not demonstrated. 0 pts Insufficient No credible private S3 backup/restore process is demonstrated.
Operator documentation quality (10) Scored on runbook usability: contains all required sections (build/publish workflow, ECR usage, architecture diagram, S3 runbook, upgrade/rollback runbook), and executable procedures a new TA/operator can follow without guesswork.	10 pts Exemplary Runbook is clear, ordered, and operator-ready; all required sections are complete, actionable, and conceptual rather than command-dump. 8 pts Proficient Runbook is usable with minor ambiguity or one small missing detail; most required sections are complete. 5 pts Developing Runbook is partially usable but requires significant inference in multiple sections or missing required content. 0 pts Insufficient Documentation is incomplete, unclear, or not executable by another operator; multiple required sections missing.

Extra Credit (up to +10)

Custom Dockerfile (+5): Build your own container image using a Dockerfile instead of re-tagging an upstream image. Include a justified base image choice and at least one hardening step (e.g., non-root user, minimal layers). Document your build choices.
In-service administration (+5): Configure secure remote administration (e.g., RCON with authentication) and execute at least 3 admin commands without stopping the container. Include safety notes about when admin intervention is appropriate vs. when a controlled restart is safer.

Extra credit must stay within this assignment’s scope (no orchestration, IaC frameworks, or CI/CD pipelines).