Watchdog Timers

Using hardware watchdogs to workaround hardware/software lockups.

Talos Linux now supports hardware watchdog timers configuration. Hardware watchdog timers allow to reset (reboot) the system if the software stack becomes unresponsive. Please consult your hardware/VM documentation for the availability of the hardware watchdog timers.

Configuration

To discover the available watchdog devices, run:

$ talosctl ls /sys/class/watchdog/
NODE         NAME
172.20.0.2   .
172.20.0.2   watchdog0
172.20.0.2   watchdog1

The implementation of the watchdog device can be queried with:

$ talosctl read /sys/class/watchdog/watchdog0/identity
i6300ESB timer

To enable the watchdog timer, patch the machine configuration with the following:

# watchdog.yaml
apiVersion: v1alpha1
kind: WatchdogTimerConfig
device: /dev/watchdog0
timeout: 5m
talosctl patch mc -p @watchdog.yaml

Talos Linux will set up the watchdog time with a 5-minute timeout, and it will keep resetting the timer to prevent the system from rebooting. If the software becomes unresponsive, the watchdog timer will expire, and the system will be reset by the watchdog hardware.

Inspection

To inspect the watchdog timer configuration, run:

$ talosctl get watchdogtimerconfig
NODE         NAMESPACE   TYPE                  ID      VERSION   DEVICE           TIMEOUT
172.20.0.2   runtime     WatchdogTimerConfig   timer   1         /dev/watchdog0   5m0s

To inspect the watchdog timer status, run:

$ talosctl get watchdogtimerstatus
NODE         NAMESPACE   TYPE                  ID      VERSION   DEVICE           TIMEOUT
172.20.0.2   runtime     WatchdogTimerStatus   timer   1         /dev/watchdog0   5m0s

Current status of the watchdog timer can also be inspected via Linux sysfs:

$ talosctl read /sys/class/watchdog/watchdog0/state
active