Skip to content

chore: increase timeout for rabbitmq probe#9030

Merged
rjsparks merged 1 commit intoietf-tools:mainfrom
jennifer-richards:waiting-for-rabbits
Jun 20, 2025
Merged

chore: increase timeout for rabbitmq probe#9030
rjsparks merged 1 commit intoietf-tools:mainfrom
jennifer-richards:waiting-for-rabbits

Conversation

@jennifer-richards
Copy link
Copy Markdown
Member

@jennifer-richards jennifer-richards commented Jun 19, 2025

Liveness probes for the RabbitMQ pod are failing occasionally in production, sometimes leading to the pod being terminated and replaced. Other than interruptions caused by the roll-over, there are no signs of problems with the service. Notably, the celery worker is processing jobs without apparent interruption, which indicates that the message queue is operating. RabbitMQ itself does not report any errors, and its memory / CPU usage are not remarkable. There are some indications that the k8s node might be busy at the time of the mq pod restart (synthetics checks had slow responses at around the same time).

My suspicion is that once in a while, perhaps during heavy load, the rabbitmq-diagnostics ping command we use is taking too long to execute. We're using a short (5s) timeout on the liveness probe. This bumps the timeout to 30 seconds on the ping command, which was using its default infinite timeout. The k8s livenessProbe config timeout is set to 35 seconds to allow time for the command to start / exit.

@rjsparks rjsparks merged commit e93a56b into ietf-tools:main Jun 20, 2025
2 checks passed
@jennifer-richards jennifer-richards deleted the waiting-for-rabbits branch June 20, 2025 19:47
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jun 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants