Overview
After the conntrack fix and server resize in
#21, UDP uptime
recovered to 99.9%. A follow-on softirq hotspot (CPU2 at 100% soft) was
diagnosed and fixed in #29
Phase 3 (RPS/RFS enabled on 2026-05-04). The hotspot was immediately resolved —
CPU soft-IRQ spread across all 8 CPUs — but the host load average remained in
the 9–10 range on an 8-vCPU machine.
This means the server is running at or beyond its comfortable capacity even with
packet steering working correctly. If traffic continues to grow, or if the
softirq tuning is ever disrupted, uptime on newTrackon is likely to degrade
again. This issue tracks planning and execution of the next server scale-up.
Current State (2026-05-05)
Post-RPS/RFS baseline collected at 2026-05-05T09:13:52Z (T+1h after Phase 3):
- Host load average:
9.24 / 9.24 / 9.43 (8-vCPU machine)
- CPU distribution (
mpstat -P ALL 1 1):
- CPU2:
%soft=49.48 (was 100% before Phase 3)
- All CPUs:
%soft in the 26–41% range
- Combined idle across all CPUs: low, indicating sustained saturation
- Docker container CPU (approximate):
caddy: ~300%
tracker: ~90%
- Request rates from Prometheus (5-minute rate):
- HTTP1: ~1982 req/s
- UDP1: ~2124 req/s
- Combined: ~4107 req/s (~513 req/s per vCPU)
- newTrackon status: both endpoints
Working
Load averages of 9–10 on an 8-vCPU host indicate the runqueue is consistently
overcommitted even after the softirq fix. There is very little headroom.
Goal
Determine the right time and target plan to resize the server so that:
- Host load average stays comfortably below the vCPU count (target: < 0.7 per
vCPU, i.e., < 5.6 on an 8-vCPU host or < 11.2 on a 16-vCPU host).
- Combined req/s per vCPU drops to a level that provides meaningful headroom.
- newTrackon uptime for both endpoints remains >= 99.0%.
Trigger Conditions
Do not resize until at least one of the following is true:
- T+next-day observation in ISSUE-29 shows the RPS/RFS fix is not holding
(CPU2 %soft returns to 100% or overall soft-IRQ pressure re-concentrates).
- newTrackon UDP uptime drops below 99.0% on the rolling 7-day window.
- newTrackon HTTP uptime drops below 99.0%.
- newTrackon response times for either endpoint increase materially (> 2×
current baseline) for more than 24 hours.
- Load average exceeds 12 sustained over a 24-hour period (1.5× vCPU count).
Track these signals in the Observation Log section below.
newTrackon Tracking
Monitor both endpoints:
- HTTP:
https://http1.torrust-tracker-demo.com:443/announce
- UDP:
udp://udp1.torrust-tracker-demo.com:6969/announce
Check and record at each observation interval:
- Status (Working / Down)
- Rolling uptime %
- Response time (ms)
Observation Log
| Date (UTC) |
HTTP1 status |
HTTP1 uptime % |
HTTP1 resp (ms) |
UDP1 status |
UDP1 uptime % |
UDP1 resp (ms) |
Load avg (1m) |
Notes |
| 2026-05-05 |
Working |
— |
— |
Working |
— |
— |
9.24 |
Baseline after RPS/RFS (ISSUE-29) |
Options Research
All prices are list prices in EUR (excl. VAT) as of May 2026.
Hetzner Cloud — AMD Dedicated vCPU (CCX Series)
These are cloud VMs with dedicated AMD vCPUs, easy to resize online via the
Hetzner console (no migration required, brief reboot only).
| Plan |
vCPU |
RAM |
NVMe SSD |
Traffic |
Price/month |
| CCX13 |
2 |
8 GB |
80 GB |
20 TB |
€16.49 |
| CCX23 |
4 |
16 GB |
160 GB |
20 TB |
€31.99 |
| CCX33 |
8 |
32 GB |
240 GB |
30 TB |
€62.99 |
| CCX43 |
16 |
64 GB |
360 GB |
40 TB |
€125.49 |
| CCX53 |
32 |
128 GB |
600 GB |
40 TB |
€250.49 |
| CCX63 |
48 |
192 GB |
960 GB |
60 TB |
€374.99 |
Current plan: CCX33 (8 vCPU / 32 GB / €62.99/mo)
Next step up: CCX43 (16 vCPU / 64 GB / €125.49/mo — +€62.50/mo)
CCX43 would reduce normalized load from ~513 req/s/vCPU to ~257 req/s/vCPU at
current traffic, and load average headroom would double.
Advantages of cloud step-up:
- No setup fee; no data migration.
- Revert is possible if the resize is not justified.
- Consistent experience with previous resize (CCX23 → CCX33).
Hetzner Dedicated Servers
Dedicated physical servers provide more cores and threads per EUR, but require
a manual server migration (data copy, DNS/IP cutover) and a one-time setup fee.
| Model |
Cores |
Threads |
RAM |
Storage |
Bandwidth |
Price/month |
Setup fee |
| EX44 |
14 |
20 |
64 GB |
2 × 512 GB NVMe |
1000 Mbit |
~€44 |
~€109 |
| AX42-U |
8 |
16 |
64 GB |
2 × 512 GB NVMe |
1000 Mbit |
~€54 |
~€234 |
| EX63 |
20 |
20 |
64 GB |
2 × 1 TB NVMe |
1000 Mbit |
~€76 |
~€325 |
| AX102-U |
16 |
32 |
128 GB |
varies |
1000 Mbit |
~€119 |
~€500 |
EX44 is the standout option if we decide to go dedicated:
- 14 physical cores / 20 threads vs 8 vCPUs today.
- 64 GB RAM (2× current).
- Monthly cost (~€44) is actually cheaper than the current CCX33 (~€62.99).
- One-time setup fee of ~€109 is recovered in roughly 2 months of savings.
- Break-even vs CCX43 (~€125.49/mo): in month 1 total spend is ~€153 vs €125;
from month 2 onwards EX44 saves ~€82/mo over CCX43.
Disadvantages of dedicated:
- Manual migration required (bring-your-own IP, data copy, DNS update).
- No online resize; rollback is much harder.
- Bare-metal; OS and boot configuration is our responsibility.
- Physical hardware failure handling differs from cloud VMs.
Decision Matrix
| Criterion |
CCX43 (cloud step-up) |
EX44 (dedicated) |
| Monthly cost |
€125.49 |
~€44 (saves ~€19/mo vs current) |
| Setup friction |
Minimal (reboot only) |
High (full migration) |
| Reversibility |
Easy |
Hard |
| CPU headroom at ~4k rps |
16 vCPU / ~257 rps/vCPU |
20 threads / ~205 rps/thread |
| RAM headroom |
64 GB |
64 GB |
| Long-term cost |
More expensive |
Cheaper after break-even (~2 months) |
| Risk |
Low |
Medium (migration complexity) |
Recommendation: Start with CCX43 if the trigger is near-term and urgency
is high. Plan migration to EX44 if sustained long-term cost reduction is the
priority once the situation is stable.
Acceptance Criteria
Refs: #29, #21
Overview
After the conntrack fix and server resize in
#21, UDP uptime
recovered to 99.9%. A follow-on softirq hotspot (CPU2 at 100% soft) was
diagnosed and fixed in #29
Phase 3 (RPS/RFS enabled on 2026-05-04). The hotspot was immediately resolved —
CPU soft-IRQ spread across all 8 CPUs — but the host load average remained in
the 9–10 range on an 8-vCPU machine.
This means the server is running at or beyond its comfortable capacity even with
packet steering working correctly. If traffic continues to grow, or if the
softirq tuning is ever disrupted, uptime on newTrackon is likely to degrade
again. This issue tracks planning and execution of the next server scale-up.
Current State (2026-05-05)
Post-RPS/RFS baseline collected at
2026-05-05T09:13:52Z(T+1h after Phase 3):9.24 / 9.24 / 9.43(8-vCPU machine)mpstat -P ALL 1 1):%soft=49.48(was 100% before Phase 3)%softin the 26–41% rangecaddy: ~300%tracker: ~90%WorkingLoad averages of 9–10 on an 8-vCPU host indicate the runqueue is consistently
overcommitted even after the softirq fix. There is very little headroom.
Goal
Determine the right time and target plan to resize the server so that:
vCPU, i.e., < 5.6 on an 8-vCPU host or < 11.2 on a 16-vCPU host).
Trigger Conditions
Do not resize until at least one of the following is true:
(CPU2
%softreturns to 100% or overall soft-IRQ pressure re-concentrates).current baseline) for more than 24 hours.
Track these signals in the Observation Log section below.
newTrackon Tracking
Monitor both endpoints:
https://http1.torrust-tracker-demo.com:443/announceudp://udp1.torrust-tracker-demo.com:6969/announceCheck and record at each observation interval:
Observation Log
Options Research
All prices are list prices in EUR (excl. VAT) as of May 2026.
Hetzner Cloud — AMD Dedicated vCPU (CCX Series)
These are cloud VMs with dedicated AMD vCPUs, easy to resize online via the
Hetzner console (no migration required, brief reboot only).
Current plan: CCX33 (8 vCPU / 32 GB / €62.99/mo)
Next step up: CCX43 (16 vCPU / 64 GB / €125.49/mo — +€62.50/mo)
CCX43 would reduce normalized load from ~513 req/s/vCPU to ~257 req/s/vCPU at
current traffic, and load average headroom would double.
Advantages of cloud step-up:
Hetzner Dedicated Servers
Dedicated physical servers provide more cores and threads per EUR, but require
a manual server migration (data copy, DNS/IP cutover) and a one-time setup fee.
EX44 is the standout option if we decide to go dedicated:
from month 2 onwards EX44 saves ~€82/mo over CCX43.
Disadvantages of dedicated:
Decision Matrix
Recommendation: Start with CCX43 if the trigger is near-term and urgency
is high. Plan migration to EX44 if sustained long-term cost reduction is the
priority once the situation is stable.
Acceptance Criteria
Refs: #29, #21