Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
e5ad25c
docs: [#405] create deployment journal directory structure for Hetzne…
josecelano Mar 3, 2026
739f003
docs: [#405] document and complete prerequisites for Hetzner demo tra…
josecelano Mar 3, 2026
61e24fe
docs: [#405] configure environment and create for Hetzner demo tracke…
josecelano Mar 3, 2026
2034809
docs: [#405] refactor hetzner-demo-tracker docs into per-command subd…
josecelano Mar 3, 2026
3ffb1d4
docs: [#405] add debug-command-failure skill for investigating deploy…
josecelano Mar 3, 2026
010a053
docs: [#405] refine Problem 5 root cause with precise log-based evidence
josecelano Mar 3, 2026
f4c5e8f
docs: [#405] add provision improvements document with deployer enhanc…
josecelano Mar 3, 2026
019e39c
fix: [#405] add IdentitiesOnly=yes to default SSH options
josecelano Mar 3, 2026
3248a63
fix: [#405] add IdentitiesOnly=yes to Ansible ssh_args
josecelano Mar 3, 2026
182e33b
feat: [#405] log full SSH stderr in wait_for_connectivity retry messages
josecelano Mar 3, 2026
a28016b
feat: [#405] increase SSH retry budget to 5 minutes with 5s interval
josecelano Mar 3, 2026
642d043
docs: [#405] add cleanup-between-attempts guide and update provision …
josecelano Mar 3, 2026
9ba435f
docs: [#405] document provision success, passphrase bug, IPv6 omissio…
josecelano Mar 3, 2026
fda04f1
docs: [#405] add attempt-4 screenshot and document Hetzner activity l…
josecelano Mar 3, 2026
3872406
docs: [#405] mark provision task as complete in issue tracker
josecelano Mar 3, 2026
db6a702
docs: [#405] add post-provision guides (DNS + volume setup) and assig…
josecelano Mar 4, 2026
837057d
docs: [#405] configure floating IPs permanently on VM via netplan
josecelano Mar 4, 2026
4a4914b
docs: document DNS record creation via Hetzner Cloud API
josecelano Mar 4, 2026
4fba745
docs: document volume setup via Hetzner Cloud API
josecelano Mar 4, 2026
1422f83
docs: fix misleading volume snapshot claim; document limitation
josecelano Mar 4, 2026
650936f
docs: document configure command execution (task 3.2 done)
josecelano Mar 4, 2026
31d724c
docs: document volume/IP setup sequencing tradeoffs in post-provision
josecelano Mar 4, 2026
beaf2f9
docs: add observations file and fill in missing ToC entries
josecelano Mar 4, 2026
e874e96
docs: document release fails when deployer runs inside Docker (docker…
josecelano Mar 4, 2026
fb7a7ff
fix: skip docker-compose local validation when docker is not in PATH
josecelano Mar 4, 2026
a5c7913
docs: document successful release command (task 3.3 done)
josecelano Mar 4, 2026
b136e5c
docs(hetzner-demo): document run command MySQL bugs and failure
josecelano Mar 4, 2026
d3d6c64
docs(hetzner-demo): clarify Bug 2 root password was never implemented
josecelano Mar 4, 2026
16167cc
docs(hetzner-demo): add Bug 3 (URL encoding) and run improvements doc
josecelano Mar 4, 2026
45f06bf
docs(hetzner-demo): populate run README and add test command docs
josecelano Mar 4, 2026
fb288b3
docs: add test command output, verify guides, floating IP improvement
josecelano Mar 4, 2026
96224e0
docs(verify): add API, HTTP tracker, and health check verification re…
josecelano Mar 4, 2026
a2d454c
docs(verify): fix corrupted results table in api.md
josecelano Mar 4, 2026
c0dd03a
docs(verify): add Grafana verification results
josecelano Mar 4, 2026
7c3d68b
docs(verify): add UDP tracker verification results
josecelano Mar 4, 2026
c5ab1db
docs(verify): add Docker services health and log verification
josecelano Mar 4, 2026
8f97bf8
docs(verify): add MySQL database connectivity verification
josecelano Mar 4, 2026
31c4869
docs(verify): add storage volume mount verification
josecelano Mar 4, 2026
6870512
docs(verify): add actual tree output to storage verification
josecelano Mar 4, 2026
487c1b5
docs(verify): add backup verification and document credentials oversight
josecelano Mar 4, 2026
7a4f714
docs(verify): add Torrust tracker client announce tests for HTTP and …
josecelano Mar 4, 2026
7001123
docs(deploy): update progress — all 9 services verified, fill in serv…
josecelano Mar 4, 2026
36af759
docs(post-provision): add screenshot of Hetzner backups enabled state
josecelano Mar 4, 2026
ea7ea22
docs(post-provision): add Hetzner backups step to ToC and post-provis…
josecelano Mar 4, 2026
0894d85
docs(maintenance): add secrets rotation guide for post-AI-agent deplo…
josecelano Mar 4, 2026
96d1628
docs(maintenance): mark Hetzner Cloud and DNS API tokens as deleted
josecelano Mar 4, 2026
8a0064d
docs(maintenance): mark Grafana admin password as rotated (step 3 done)
josecelano Mar 4, 2026
a3775f1
docs(maintenance): add OS updates guide with apply/verify/reboot proc…
josecelano Mar 4, 2026
c684920
docs(maintenance): fix restart vs recreate for env var changes in step 1
josecelano Mar 4, 2026
c55763c
docs(maintenance): mark tracker admin token rotation as done (step 1)
josecelano Mar 4, 2026
d3d1401
docs(maintenance): mark MySQL torrust and root password rotation as d…
josecelano Mar 4, 2026
fda93d0
docs(maintenance): mark SSH deployer key rotation as done (step 4)
josecelano Mar 4, 2026
954a6a8
docs(maintenance): mark local file archival done (step 7) and secrets…
josecelano Mar 4, 2026
4b55ad0
docs(maintenance): add step 4g - delete old SSH key from Hetzner cons…
josecelano Mar 4, 2026
9b11d27
docs: record successful reboot, tick API/SSH checklist, add public RE…
josecelano Mar 4, 2026
4c79724
docs(maintenance): add uptime monitoring guide
josecelano Mar 4, 2026
81a31b9
docs: add tracker registry guide for newTrackon submission
josecelano Mar 4, 2026
11c020e
docs: add bugs index for hetzner demo tracker deployment
josecelano Mar 4, 2026
675b826
docs: add improvements index for hetzner demo tracker deployment
josecelano Mar 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions .github/skills/usage/operations/debug-command-failure/skill.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
---
name: debug-command-failure
description: Guide for debugging and investigating deployer command failures. Covers reading error output, locating trace files, inspecting environment state, examining build artifacts, and running manual verification steps. Use when any deployer command (provision, configure, release, run, etc.) fails. Triggers on "command failed", "debug failure", "investigate error", "why did it fail", "trace", "deployer error", or "command error".
metadata:
author: torrust
version: "1.0"
---

# Debugging Deployer Command Failures

This skill walks through collecting and interpreting diagnostic information when any deployer
command fails.

## Investigation Layers (in order)

```text
1. Console error output → immediate symptom + tip
2. Environment state → data/{env}/environment.json
3. Trace log → data/{env}/traces/{timestamp}-{command}.log
4. Build artifacts → build/{env}/
5. Manual verification → SSH, curl, provider console
```

Work top-to-bottom. Each layer provides richer context than the previous.

---

## Layer 1 — Console Error Output

A failed command prints:

```text
❌ <command> command failed: <error summary>
Tip: <actionable hint>
Tip: Check logs and try running with --log-output file-and-stderr for more details
```

Note the **error summary** and the **tip** lines. The summary often names the failed step and the
kind of error.

---

## Layer 2 — Environment State

After any command failure, the deployer writes machine-readable state:

```text
data/{env-name}/environment.json
```

Key fields to inspect:

```json
{
"state": {
"context": {
"failed_step": "WaitSshConnectivity",
"error_kind": "NetworkConnectivity",
"error_summary": "SSH connectivity failed: ...",
"failed_at": "2026-03-03T15:33:32Z",
"execution_started_at": "2026-03-03T15:30:00Z",
"execution_duration": { "secs": 212, "nanos": 885591647 },
"trace_id": "bcba0ee9-b2cf-4302-be0e-5ed04c665141",
"trace_file_path": "./data/{env-name}/traces/20260303-153332-provision.log"
}
}
}
```

| Field | What it tells you |
| -------------------- | ---------------------------------------------------------- |
| `failed_step` | Which internal step failed (maps to deployer source code) |
| `error_kind` | Category: `NetworkConnectivity`, `TemplateRendering`, etc. |
| `error_summary` | Human-readable description of the error |
| `execution_duration` | How long the command ran before failing |
| `trace_file_path` | Exact path to the full trace log |

```bash
# Quick inspection
cat data/{env-name}/environment.json | python3 -m json.tool
# or
jq '.state.context' data/{env-name}/environment.json
```

---

## Layer 3 — Trace Log

The trace log records every step, sub-step, and decision the deployer made:

```text
data/{env-name}/traces/{YYYYMMDD-HHMMSS}-{command}.log
```

The exact path is in `environment.json → state.context.trace_file_path`.

```bash
# Read the full log
cat data/{env-name}/traces/20260303-153332-provision.log

# Focus on errors and warnings
grep -E 'ERROR|WARN|failed|error' data/{env-name}/traces/20260303-153332-provision.log

# Show the last 50 lines (where failures are usually recorded)
tail -50 data/{env-name}/traces/20260303-153332-provision.log
```

The trace contains structured log lines with timestamps, log levels, and context fields. Look for
`ERROR` lines and the step names that precede them.

---

## Layer 4 — Build Artifacts

The `build/` directory holds rendered templates and intermediate files generated before
infrastructure is touched:

```text
build/{env-name}/
├── tofu/
│ └── hetzner/ (or lxd/)
│ ├── main.tf # OpenTofu infrastructure definition
│ ├── cloud-init.yml # cloud-init script run on first boot
│ └── *.tf # Other Terraform/OpenTofu files
└── ansible/
├── inventory.ini # Ansible inventory
└── playbooks/ # Ansible playbooks
```

Common inspections:

```bash
# Verify SSH public key was correctly injected into cloud-init
grep -A3 'ssh_authorized_keys' build/{env-name}/tofu/hetzner/cloud-init.yml

# Compare with the actual public key
cat ~/.ssh/torrust_tracker_deployer_ed25519.pub

# Inspect the infrastructure definition
cat build/{env-name}/tofu/hetzner/main.tf
```

**Why this matters**: Build artifacts are generated from your config file without touching the
cloud provider. If the artifact is wrong, the root cause is in the environment config or a
template bug — not in the network or provider.

---

## Layer 5 — Manual Verification

When the deployer fails but the cloud resource appears to be up, verify the resource directly.

### SSH connectivity

```bash
# Test SSH manually with verbose output (-v shows handshake details)
ssh -v -i ~/.ssh/torrust_tracker_deployer_ed25519 torrust@{server-ip} "whoami && cloud-init status"
```

A successful response looks like:

```text
torrust
status: done
```

If `cloud-init status` returns `status: running`, cloud-init is still executing — wait and retry.

### Cloud-init timing

```bash
# Check cloud-init completion and timing
ssh -i ~/.ssh/torrust_tracker_deployer_ed25519 torrust@{server-ip} \
"cloud-init status --long && sudo journalctl -u ssh --since '5 minutes ago' | tail -20"
```

**Note**: If the clock timestamp shows `1970-01-01`, the system clock was not yet NTP-synced when
cloud-init completed — this is normal and does not indicate a failure.

### Port availability

```bash
# Check if SSH port is open (times out quickly if no service is listening)
nc -zv {server-ip} 22

# Check if HTTP tracker port is open
nc -zv {server-ip} 6969
```

---

## Common Error Patterns

| `failed_step` | `error_kind` | Likely Cause |
| ------------------------- | --------------------- | -------------------------------------------------------------------- |
| `RenderOpenTofuTemplates` | `TemplateRendering` | SSH key path not found — check container vs host path in config |
| `WaitSshConnectivity` | `NetworkConnectivity` | Server SSH not ready within timeout — server may need more boot time |
| `RunAnsiblePlaybook` | `Ansible` | SSH key rejected or unreachable — verify `~/.ssh/known_hosts` |
| `CreateServer` | `ProviderApi` | API token invalid or quota exceeded — check Hetzner console |

---

## After Investigation

Once the root cause is identified, the recovery path depends on how far the command progressed:

- **Failed before any cloud resources were created** (e.g., `TemplateRendering`): fix the config,
`purge --force`, `create environment`, retry command.

- **Failed after cloud resources were created** (e.g., `WaitSshConnectivity`): the deployer state
is `ProvisionFailed` or `ConfigureFailed`. Resources exist in the cloud. Must `destroy` to clean
up both cloud resources and local state, then `create environment` and retry.

```bash
# Destroy cloud resources + local state
docker run --rm \
-v $(pwd)/data:/var/lib/torrust/deployer/data \
-v $(pwd)/build:/var/lib/torrust/deployer/build \
-v $(pwd)/envs:/var/lib/torrust/deployer/envs \
-v ~/.ssh:/home/deployer/.ssh:ro \
torrust/tracker-deployer:latest \
destroy {env-name}

# Recreate local environment
docker run --rm \
-v $(pwd)/data:/var/lib/torrust/deployer/data \
-v $(pwd)/build:/var/lib/torrust/deployer/build \
-v $(pwd)/envs:/var/lib/torrust/deployer/envs \
torrust/tracker-deployer:latest \
create environment --env-file envs/{env-name}.json
```
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,7 @@ Available skills:
| Creating issues | `.github/skills/dev/planning/create-issue/skill.md` |
| Creating new skills | `.github/skills/add-new-skill/skill.md` |
| Creating refactor plans | `.github/skills/dev/planning/create-refactor-plan/skill.md` |
| Debugging command failures | `.github/skills/usage/operations/debug-command-failure/skill.md` |
| Debugging test errors | `.github/skills/dev/testing/debug-test-errors/skill.md` |
| Handling errors in code | `.github/skills/dev/rust-code-quality/handle-errors-in-code/skill.md` |
| Handling secrets | `.github/skills/dev/rust-code-quality/handle-secrets/skill.md` |
Expand Down
125 changes: 125 additions & 0 deletions docs/deployments/hetzner-demo-tracker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Deployment Journal: Hetzner Demo Tracker

**Issue**: [#405](https://github.com/torrust/torrust-tracker-deployer/issues/405)
**Date started**: 2026-03-03
**Domain**: `torrust-tracker-demo.com`
**Provider**: Hetzner Cloud

## Purpose

Deploy a public Torrust Tracker demo instance to Hetzner Cloud and document every step of the process. This journal will serve as the source material for a blog post on [torrust.com](https://torrust.com).

## Table of Contents

1. [Prerequisites](prerequisites.md) — Account setup, tools, SSH keys
2. [Deployment Specification](deployment-spec.md) — What we want to deploy: config decisions,
endpoints, sanitized config
3. Deployment commands — step-by-step per deployer command:
- [create](commands/create/README.md) — generate template, validate, create environment
- [provision](commands/provision/README.md) — create the Hetzner VM
- [configure](commands/configure/README.md) — install Docker and Docker Compose on the server
- [release](commands/release/README.md) — pull and stage Docker images
- [run](commands/run/README.md) — start all services
4. Post-provision manual steps (done once, before `configure`):
- [DNS setup](post-provision/dns-setup.md) — assign floating IPs, create DNS records, verify
- [Volume setup](post-provision/volume-setup.md) — create and mount Hetzner volume for storage
- [Hetzner Backups](post-provision/hetzner-backups.md) — enable automated server backups (can be done any time after provisioning)
5. [Service Verification](verify/README.md) — verifying all services after deployment:
- [HTTP Tracker](verify/http-tracker.md)
- [UDP Tracker](verify/udp-tracker.md)
- [Tracker API](verify/api.md)
- [Grafana](verify/grafana.md)
- [Health Check](verify/health-check.md)
- [Docker Services](verify/docker-services.md)
- [MySQL Database](verify/mysql.md)
- [Storage Volume](verify/storage.md)
- [Backup](verify/backup.md)
6. Problems — issues encountered, per command:
- [create problems](commands/create/problems.md)
- [provision problems](commands/provision/problems.md)
7. Improvements — recommended deployer improvements found during this deployment:
- [provision improvements](commands/provision/improvements.md)
8. [Observations](observations.md) — cross-cutting insights and learnings about the deployer
9. [Maintenance](maintenance/README.md) — post-deployment operational tasks:
- [Secrets rotation](maintenance/secrets-rotation.md) — rotate all secrets after AI-assisted deployment
10. [Tracker Registry](tracker-registry.md) — submit the tracker to public registries (newTrackon)
11. [Bugs](bugs.md) — all deployer bugs discovered during this deployment (11 bugs, 1 fixed)
12. [Improvements](improvements.md) — all improvement recommendations collected in one place (13 items)

## Deployment

> This section will be filled in as we execute each deployment phase.

### Phase 1: Setup and Prerequisites

See [prerequisites.md](prerequisites.md) for the complete checklist.

### Phase 2: Create and Configure Environment

See [deployment-spec.md](deployment-spec.md) for config decisions and the sanitized config.
See [commands/create/README.md](commands/create/README.md) for running the `create template`, `validate`, and
`create environment` commands.

### Phase 3: Provision Infrastructure

See [commands/provision/README.md](commands/provision/README.md) for running the `provision` command and server
details.

### Phase 3.5: Post-Provision Setup

Manual steps done once after provisioning, required before `configure`:

1. [DNS setup](post-provision/dns-setup.md) — assign floating IPs to the server and create DNS
records for all six domains.
2. [Volume setup](post-provision/volume-setup.md) — create a 50 GB Hetzner volume and mount it
at `/opt/torrust/storage` so persistent data lives on a separate disk.
3. [Hetzner Backups](post-provision/hetzner-backups.md) — enable automated daily server backups
via the Hetzner Console (can be done at any time after provisioning).

See [post-provision/README.md](post-provision/README.md) for the full overview.

### Phase 4: Configure Instance

See [commands/configure/README.md](commands/configure/README.md) for running the `configure`
command. Installs Docker 28.2.2 and Docker Compose v2.29.2.

### Phase 5: Release Application

See [commands/release/README.md](commands/release/README.md) for running the `release`
command. Pulled and staged all Docker images (~134 s, state=`Released`).

### Phase 6: Run Services

See [commands/run/README.md](commands/run/README.md) for running the `run`
command. All services started successfully (state=`Running`).

### Phase 7: Verify Deployment

See [verify/README.md](verify/README.md) for the full verification index.
All 9 services verified — HTTP tracker, UDP tracker, Tracker API, Grafana,
health check, Docker services, MySQL database, storage volume, and backup.
Verification included end-to-end announce tests using the Torrust reference
client (`http_tracker_client` and `udp_tracker_client`).

## Service Endpoints

> Will be filled after deployment.

| Service | URL | Status |
| -------------- | ------------------------------------------------- | ---------- |
| HTTP Tracker 1 | `https://http1.torrust-tracker-demo.com/announce` | ✅ Running |
| HTTP Tracker 2 | `https://http2.torrust-tracker-demo.com/announce` | ✅ Running |
| UDP Tracker 1 | `udp://udp1.torrust-tracker-demo.com:6969` | ✅ Running |
| UDP Tracker 2 | `udp://udp2.torrust-tracker-demo.com:6868` | ✅ Running |
| Tracker API | `https://api.torrust-tracker-demo.com/api/v1` | ✅ Running |
| Health Check | `http://127.0.0.1:1313/health_check` (internal) | ✅ Running |
| Grafana | `https://grafana.torrust-tracker-demo.com` | ✅ Running |

## Cost

> Will be documented after choosing server type.

| Resource | Monthly Cost (EUR) |
| -------- | ------------------ |
| Server | TBD |
| Total | TBD |
Loading
Loading