Skip to content

docs: [#310] research database backup strategies#312

Merged
josecelano merged 31 commits intomainfrom
310-research-database-backup-strategies
Jan 30, 2026
Merged

docs: [#310] research database backup strategies#312
josecelano merged 31 commits intomainfrom
310-research-database-backup-strategies

Conversation

@josecelano
Copy link
Copy Markdown
Member

@josecelano josecelano commented Jan 28, 2026

Summary

Comprehensive research documentation for database backup strategies as part of Epic #309 (Add backup support).

This PR includes complete research for SQLite and MySQL backup strategies, backup tools evaluation, container backup architectures, a working proof-of-concept backup container with 58 bats-core unit tests, and a recommended solution (Maintenance Window Hybrid approach).

What's Included

Database Backup Strategies

SQLite

  • Backup approaches: .backup command (Online Backup API), VACUUM INTO, file copy risks
  • WAL mode analysis: Checkpointing behavior, persistence, pros/cons
  • Backup verification and restore procedures: Integrity checks, recovery steps
  • Torrust Live Demo analysis: Current implementation (unsafe cp), proposed improvements
  • ⚠️ Critical Large Database Finding: SQLite .backup stalls at 10% after 16+ hours for 17GB database (~37 MB/hour effective rate). Maintenance window backup completes in 72 seconds.

MySQL

  • Backup approaches: mysqldump, physical backups, binary log backups
  • Container-specific considerations: Accessing MySQL in Docker containers
  • Backup verification and restore procedures

Container Backup Architectures

  • 5 patterns documented: Host Crontab, Centralized, Sidecar, Orchestrator, External Tool
  • Comparison matrix with pros/cons
  • Decision flowchart for pattern selection

Backup Tools Evaluation

  • Restic: Recommended - mature, encrypted, deduplicated, Docker support
  • ⚠️ Kopia: Alternative - newer, more features (GUI, ECC, server mode), less mature
  • Rustic: Discarded - beta status, not production-ready
  • Two-phase backup approach documented (DB dump → file backup)

Solution Comparison (NEW)

Four backup solutions evaluated with detailed trade-off analysis:

Solution Best For Complexity
Continuous Sidecar Hot backups, simple setup Low
Maintenance Window Large DBs, complete consistency Medium
External Scheduler Multi-service environments High
Native Database WAL-enabled SQLite Low

Recommended Solution: Maintenance Window Hybrid (95% container, 5% host script)

Maintenance Window Backup POC (Complete - NEW)

A working proof-of-concept with 58 bats-core unit tests supporting both MySQL and SQLite:

Feature Status
MySQL backup with mysqldump ✅ Complete
SQLite backup with sqlite3 ✅ Complete
Config file backup ✅ Complete
Retention policy (delete old backups) ✅ Complete
Single mode (run once, exit) ✅ Complete
Continuous mode (loop) ✅ Complete
Host orchestration script ✅ Complete
Crontab configuration ✅ Complete
58 unit tests ✅ All passing

POC Artifacts:

  • Multi-stage Dockerfile with MySQL and SQLite support
  • backup.sh script with modular functions
  • maintenance-backup.sh host orchestration script
  • Docker Compose examples for MySQL and SQLite
  • Production and test crontab configurations
  • Lessons learned document with implementation concerns

Key Findings

Finding Details
SQLite Safe Backup Use .backup command (Online Backup API) - safe during concurrent writes
SQLite Large DB Limitation .backup impractical for DBs > 1GB due to locking overhead (~37 MB/hour)
Maintenance Window Backup 72 seconds for 17GB SQLite (vs ~17 days with .backup)
Disk I/O Capacity 445 MB/s proven - SQLite locking is bottleneck, not disk
MySQL Backup mysqldump works reliably for containerized deployments
WAL Mode Optional for safe backups, useful for read performance under high load
Recommended Tool Restic - battle-tested, simple, Docker-native, sufficient features
Recommended Solution Maintenance Window Hybrid - container + host crontab
Sidecar Pattern Best for single-server deployments with few services

Lessons Learned (Implementation Concerns)

Key pain points discovered during POC that affect future implementation:

Pain Point Severity Notes
Template conditionals for DB type Medium Docker Compose env vars differ for MySQL vs SQLite
Path translation (host/container) Medium Multiple representations of same path
SSH agent key selection Low Use IdentitiesOnly=yes
Container exits in single mode Low Expected behavior, just surprising
Log rotation missing Low Easy to add, often forgotten
Backup verification missing Medium Important for production

Related Issues

Checklist

Research Complete

  • SQLite backup approaches documented
  • SQLite large database findings (17GB test)
  • MySQL backup approaches documented
  • WAL mode analysis with checkpointing behavior
  • Backup verification and restore procedures
  • Torrust Live Demo analysis
  • Container backup architectures (5 patterns)
  • Backup tools evaluation (Restic, Kopia, Rustic)
  • Solution comparison (4 approaches)
  • Recommended solution documented

POC Complete

  • Multi-stage Dockerfile with MySQL and SQLite support
  • 58 bats-core unit tests (all passing)
  • MySQL backup/restore validated
  • SQLite backup/restore validated
  • Config file backup
  • Retention policy (delete expired backups)
  • Single mode (run once, exit)
  • Continuous mode (loop with interval)
  • Host orchestration script
  • Crontab configurations (production + test)
  • Docker Compose examples (MySQL + SQLite)
  • Lessons learned document
  • Issue spec progress updated (all tasks complete)

Future Work (out of scope for this PR)

  • Implement backup command in deployer
  • Off-site transfer automation (S3, Backblaze B2)
  • Backup encryption
  • Backup verification command

Documentation Structure

docs/research/backup-strategies/
├── README.md                           # Overview and navigation
├── conclusions.md                      # Key findings and recommendations
├── requirements.md                     # Design preferences
├── architectures/
│   └── container-patterns.md           # 5 architecture patterns
├── databases/
│   ├── mysql/
│   │   ├── README.md
│   │   └── backup-approaches.md
│   └── sqlite/
│       ├── README.md
│       ├── backup-approaches.md
│       ├── large-database-backup.md    # Critical 17GB findings
│       └── torrust-live-demo/
│           ├── README.md
│           ├── current-implementation.md
│           └── proposed-improvements.md
├── tools/
│   ├── README.md                       # Tools overview
│   ├── restic.md                       # Detailed Restic evaluation
│   └── restic-vs-kopia.md              # Comparison document
└── solutions/
    ├── README.md                       # Solution comparison (NEW)
    ├── sidecar-container/              # Original sidecar POC
    └── maintenance-window/             # Recommended solution (NEW)
        ├── README.md                   # Architecture and workflow
        ├── implementation-recommendations.md  # Lessons learned
        └── artifacts/
            ├── backup-container/
            │   ├── Dockerfile
            │   ├── backup.sh
            │   └── backup_test.bats    # 58 tests
            ├── docker-compose-with-backup-mysql.yml
            ├── docker-compose-with-backup-sqlite.yml
            ├── maintenance-backup.sh
            ├── maintenance-backup.cron
            └── maintenance-backup-test.cron

@josecelano josecelano self-assigned this Jan 28, 2026
Research documentation covering:

SQLite backup strategies:
- Backup approaches (.backup command, VACUUM INTO, file copy risks)
- WAL mode analysis with checkpointing behavior
- Backup verification and restore procedures
- Torrust Live Demo analysis (current unsafe cp, proposed .backup)

Container backup architectures:
- 5 patterns documented (Host Crontab, Centralized, Sidecar, Orchestrator, External Tool)
- Comparison matrix with pros/cons
- Decision flowchart for pattern selection

Backup tools evaluation:
- Restic: Recommended - mature, encrypted, deduplicated, Docker support
- Kopia: Alternative - newer, more features (GUI, ECC, server mode)
- Rustic: Discarded - beta status, not production-ready
- Two-phase backup approach (DB dump → file backup)

Key findings:
- Use .backup command for SQLite (Online Backup API, safe during writes)
- WAL mode optional for safe backups (useful for read performance)
- Restic is best fit: battle-tested, simple, Docker-native, sufficient features

Related issues created on torrust-demo:
- Issue #85: Use .backup instead of cp
- Issue #86: Evaluate WAL mode for high-traffic scenario
@josecelano josecelano force-pushed the 310-research-database-backup-strategies branch from 773c4f9 to 848dbde Compare January 28, 2026 10:40
Key conclusions:
- SQLite: Use .backup command (Online Backup API), WAL mode optional
- Tool: Restic recommended (mature, encrypted, Docker-native)
- Scope: Document best practices but don't automate in deployer yet

Rationale for not automating:
- Backup strategies are opinionated and vary by user preference
- Cloud providers offer native backup/snapshot tools
- Some users prefer infrastructure-level over application-level backups
- Adding backup automation increases configuration complexity

Recommended approach:
- Document best practices (done)
- Implement manually in Torrust Live Demo
- Provide templates/examples for users who want to implement
- Add MySQL backup approaches documentation (mysqldump, Percona XtraBackup)
- Document InnoDB lock-free backup with --single-transaction
- Add sidecar container backup solution as recommended pattern
- Document files to backup with host-to-container path mapping
- Add Restic best practices section (staging pattern, tags, verification)
- Update issue spec to mark SQLite research goals complete
- Add technical terms to project dictionary
- Add proof-of-concept implementation plan for sidecar container
- Document performance/scalability considerations (17GB database)
- Answer all open questions with decisions
- Add backup execution flexibility requirements
- Environment manual-test-sidecar-backup created and running
- Verified all 4 MySQL tables use InnoDB engine
- Tracker API accessible and responding
- Documented instance details and validation results
- Unified backup.sh script handles MySQL and config backups
- Configuration-driven via environment variables (no rebuild needed)
- backup-paths.txt file for flexible path specification
- Standardized storage structure: etc/ lib/ log/
- Backs up: .env, docker-compose.yml, tracker/etc, prometheus/etc, grafana/provisioning
- Removed separate backup-mysql.sh and entrypoint.sh scripts
- Add production-considerations.md documenting security, performance,
  reliability, and operational issues to address for production use
- Update Dockerfile to run as torrust user (uid=1000) instead of root
- Matches host app user for correct backup file ownership
- Rename from 'Archive Creation' to 'Backup Maintenance'
- Two-phase approach: raw backup then compress/cleanup
- Add compression for config files older than 1 hour
- Add retention policy with BACKUP_RETENTION_DAYS env var
- Document no-overlap behavior of sequential loop
- Explain when restic would be needed vs simple bash
…ntion)

- Add run_maintenance() function after each backup cycle
- Implement compress_old_config_backups() - package configs older than 1 hour
- Implement apply_retention_policy() - delete backups older than N days
- Add BACKUP_RETENTION_DAYS env var (default: 7) to docker-compose
- Update script and Dockerfile headers with new env var documentation
- MySQL dumps compressed immediately during backup for efficiency
- Config files packaged later in maintenance phase for storage efficiency
- Add comprehensive function documentation with Arguments, Returns, Side Effects
- Document all 25+ functions with consistent style
- Add explanatory comments for complex logic (packaging rationale, streaming)
- Fix counting bugs: replace grep -c with wc -l for reliable integer results
- Simplify delete_old_files_from using find -delete
- Script is now ~46% documentation (264/570 lines)
- Add 44 unit tests covering all helper functions
- Test naming follows project convention: it_should_{behavior}_when_{condition}
- Tests run during Docker build - build fails if tests fail
- Multi-stage Dockerfile: test stage creates marker file, production stage requires it
- Make constants configurable for test isolation (BACKUP_DIR_MYSQL, etc.)
- Fix is_comment_or_empty to handle whitespace-only lines

Tested functions:
- Text processing: is_comment_or_empty, trim_whitespace
- Configuration: get_interval, get_retention_days, get_paths_file, is_mysql_enabled
- File system: ensure_directory_exists, get_file_size, has_valid_paths_file
- MySQL: generate_mysql_backup_path, validate_mysql_configuration
- Maintenance: cleanup_empty_directories, delete_old_files_from
- Logging: log, log_header, log_item, log_error
Tested and documented all restore procedures:

- MySQL restore to test database (validation)
- MySQL restore to production database
- Config file restore
- Full disaster recovery simulation

Key findings:
- RTO ~15 seconds for small databases
- All 4 tables restored correctly
- Tracker healthy after restore
- Hidden files (.env) need explicit copy

Documented issues:
- cp -r dir/* doesn't copy hidden files
- MySQL 'keys' is a reserved word (but backup handles this)
Real-world testing on 17GB Torrust Demo production database:

- SQLite .backup command is unusable for large databases under load
  - Ran 16+ hours, stalled at 10% (1.7GB of 17GB)
  - Effective rate: ~37 MB/hour vs disk capable of 445 MB/s
  - Never completed due to constant restart-on-modification

- Maintenance window approach tested and verified
  - 72 seconds for complete 17GB backup (with tracker stopped)
  - ~90 seconds total downtime including stop/start
  - Off-site transfer: 9 minutes at 32.3 MB/s

- Added size-based scalability recommendations
  - <1GB: use .backup (works well)
  - 1-10GB: consider maintenance window
  - >10GB: must use alternatives (LVM/ZFS snapshots, Litestream)

- Documents alternative approaches for large databases
  - Filesystem snapshots (instant, no downtime)
  - VACUUM INTO (compacted copy)
  - Litestream (continuous replication)
  - WAL mode with checkpoint control
- Complete Phase 7 (Documentation Update) with lessons learned
- Update preliminary conclusions with critical large database warning
- Mark POC as complete (all 7 phases done)

Key findings documented:
- Sidecar container pattern only practical for databases < 1GB
- SQLite .backup stalls for large databases under concurrent load
- Maintenance window backup (72s for 17GB) is the practical alternative
- 44 unit tests validate backup script behavior
Reorganize the backup-strategies research documentation:

- Move database-specific docs to databases/ (mysql/, sqlite/)
- Move container-backup-architectures.md to architectures/container-patterns.md
- Rename preliminary-conclusions.md to conclusions.md
- Rename requirements-notes.md to requirements.md
- Move POC files to solutions/sidecar-container/
- Move sidecar-container.md to solutions/sidecar-container/design.md
- Delete redundant proof-of-concept.md

New structure:
- databases/mysql/ - MySQL backup approaches
- databases/sqlite/ - SQLite backup approaches and large DB findings
- architectures/ - Container backup patterns
- tools/ - Backup tool evaluations (restic, etc.)
- solutions/sidecar-container/ - Complete POC with phases and artifacts

All internal links updated to reflect new paths.
Add two proposed solutions for handling large database backups:

- exclude-statistics: Backup only essential data, exclude stats tables
- maintenance-window: Host-level backup with service stop/restart

These alternatives address the finding that sidecar container backup
is only practical for databases < 1GB.
Analyzed 17GB production database:
- torrents table: 161M rows (~8 GB, 99.8% of DB)
- 96.9% of torrents have completed=0 (never downloaded)
- Excluding these reduces backup to ~247 MB (98.5% reduction)

Key limitation documented: This reduces backup SIZE but NOT
backup TIME under heavy load due to SQLite locking contention.
…ault backup interval

- Add maintenance-window artifacts folder with:
  - maintenance-backup.sh: host-level orchestration script
  - maintenance-backup.cron: crontab entry for daily 3 AM backup
  - backup-container/: Dockerfile and backup script with BACKUP_MODE support
  - docker-compose files and environment config
- Update default BACKUP_INTERVAL from 120s to 86400s (24 hours)
- Replace inline script in README with artifacts folder reference
- Both sidecar and maintenance-window solutions share the same backup script
- Add backup_test.bats for maintenance-window backup script (48 tests)
- Update sidecar-container tests with BACKUP_MODE tests
- Update default interval tests from 120s to 86400s (24 hours)
- Add shellcheck directives for bats-specific warnings
- Add clarifying comments about /data mount point in docker-compose files
- Document why we mount entire deployment directory (root .env + storage/)
- Add 'subshells' to project dictionary
- Update conclusions.md with recommended solution section
- Compare maintenance-window vs sidecar approaches
- Update solutions/README.md to recommend maintenance-window
- Update main README.md with new recommendation
- Update sidecar-container README.md with limitation warning
- Fix reference paths in conclusions.md

The maintenance-window hybrid approach is now the recommended solution:
- 95%+ of logic in portable container
- Works for databases of any size (17GB in ~90s vs 16+ hours)
- Simple crontab + ~50 lines of host script
- Could be automated by deployer in Configure phase
- Add SQLite backup functionality alongside MySQL support
- Add BACKUP_SQLITE_ENABLED and SQLITE_DATABASE_PATH environment variables
- Add backup_sqlite(), generate_sqlite_backup_path(), dump_sqlite_database() functions
- Add is_sqlite_enabled(), get_sqlite_database_path(), validate_sqlite_configuration()
- Add sqlite3 package to Dockerfile for database backup
- Change default BACKUP_MODE from continuous to single
- Change default BACKUP_INTERVAL from 120s to 86400s (24 hours)
- Make logging consistent: both MySQL and SQLite show database details conditionally
- Add 10 new SQLite unit tests (58 total tests)
- Rename docker-compose-with-backup.yml to docker-compose-with-backup-mysql.yml
- Add docker-compose-with-backup-sqlite.yml for SQLite deployments
- Add maintenance-backup-test.cron for testing with 2-minute interval
Captures practical concerns and edge cases discovered during research:
- Template complexity for MySQL vs SQLite env vars
- Path translation between host/container contexts
- SSH agent key selection issues
- Backup container single mode behavior
- Configuration validation discoveries
- Crontab and log rotation considerations
- Open questions for future implementation
@josecelano josecelano marked this pull request as ready for review January 30, 2026 12:51
@josecelano
Copy link
Copy Markdown
Member Author

ACK 1168780

@josecelano josecelano merged commit a09067c into main Jan 30, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Research database backup strategies

1 participant