Skip to content

Commit 6a377ea

Browse files
committed
docs(issue-29): record phase 2 T+1h observation
No CPU improvement after ~1h36m. CPU2 still 100% softirq; Caddy ~321%. HTTP/3 (UDP 443) is ruled out as root cause. Refs: #29
1 parent 046eb11 commit 6a377ea

2 files changed

Lines changed: 43 additions & 6 deletions

File tree

docs/issues/ISSUE-29-research-high-cpu-load-after-udp-fix.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,8 +97,10 @@ step.
9797

9898
- [x] Remove `"443:443/udp"` from the Caddy service in `server/opt/torrust/docker-compose.yml`.
9999
- [x] Apply only that change on the live server and restart only Caddy.
100-
- [ ] Observe CPU, request rates, and external service health at T+1 h (≈ 2026-05-04 16:31 UTC)
101-
and again the following day (2026-05-05).
100+
- [x] Observe CPU, request rates, and external service health at T+1 h (≈ 2026-05-04 16:31 UTC).
101+
**Result: no improvement. CPU2 still 100% softirq; Caddy ~321%; load ~8.5. HTTP/3 is not
102+
the cause.** See `01-phase2-disable-http3-execution.md` T+1 h section.
103+
- [ ] Observe the following day (2026-05-05) to confirm no delayed effect.
102104
- [ ] Decide whether Caddy CPU dropped materially enough to keep HTTP/3 disabled.
103105

104106
Execution and immediate post-change checks are recorded in

docs/issues/evidence/ISSUE-29/01-phase2-disable-http3-execution.md

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,13 +110,48 @@ From `https://newtrackon.com/raw` during this window:
110110

111111
Agreed observation windows for Phase 2:
112112

113-
| Checkpoint | Target time (UTC) | Status |
114-
| ---------- | ------------------------ | ------- |
115-
| T+1 h | 2026-05-04 16:31 | pending |
116-
| T+next day | 2026-05-05 (any morning) | pending |
113+
| Checkpoint | Target time (UTC) | Status |
114+
| ---------- | ------------------------ | -------- |
115+
| T+1 h | 2026-05-04 16:31 | complete |
116+
| T+next day | 2026-05-05 (any morning) | pending |
117117

118118
Capture the same metrics at each checkpoint: `mpstat`, `docker stats`, Prometheus
119119
HTTP1/UDP1 rates, and a `newtrackon.com/raw` sample.
120120

121+
## T+1 h Observation (2026-05-04T16:54:13Z)
122+
123+
Capture timestamp (UTC): `2026-05-04T16:54:13Z` (~1 h 36 min after change).
124+
125+
- Host load average: `8.52 / 8.25 / 8.03`
126+
- `mpstat` all CPUs: `%usr=34.11`, `%sys=15.43`, `%soft=19.58`, `%idle=30.61`
127+
- `mpstat` CPU2: `%soft=100.00`, `%idle=0.00`**unchanged from pre-change**
128+
- Container CPU snapshot:
129+
- `caddy`: `321.33%`
130+
- `tracker`: `95.47%`
131+
- `mysql`: `7.06%`
132+
- `grafana`: `0.32%`
133+
- `prometheus`: `0.00%`
134+
- `ps` top processes: `caddy 301%`, `torrust-tracker 88.9%`, `ksoftirqd/2 15.0%`
135+
- Prometheus rates:
136+
- HTTP1 request rate: `1834.0 req/s`
137+
- UDP1 request rate: `2440.0 req/s`
138+
139+
### External probe sample (newtrackon.com/raw)
140+
141+
- `https://http1.torrust-tracker-demo.com:443/announce` -> `Working`
142+
- `udp://udp1.torrust-tracker-demo.com:6969/announce` -> `Working`
143+
144+
### Assessment
145+
146+
**No improvement observed.** CPU2 remains 100% softirq (`ksoftirqd/2` still
147+
pinned). Load, Caddy CPU (~320%), and tracker CPU (~95%) are all within the same
148+
range as before the change. Removing the Caddy UDP 443 port had no measurable
149+
effect on the softirq saturation, ruling out HTTP/3 (QUIC) as the root cause.
150+
151+
The Phase 2 change is safe to keep (it was correct hygiene — we have no HTTP/3
152+
listener anyway), but it did not solve the CPU problem. The investigation must
153+
continue with Phase 3 (RPS/RFS CPU affinity) or a deeper look at why Caddy
154+
alone is consuming ~300% CPU at the observed request rate.
155+
121156
Keep this single change in place until both checkpoints are completed before
122157
deciding whether to keep HTTP/3 disabled permanently or revert.

0 commit comments

Comments
 (0)