Watchdog (watchdog) in Linux (Armbian/Ubuntu/Debian)

Introduction

Watchdog (https://en.wikipedia.org/wiki/%D0%A1%D1%82%D0%BE%D1%80%D0%BE%D0%B6% D0%B5%D0%B2%D0%BE%D0%B9_%D1%82%D0%B0%D0%B9%D0%BC%D0%B5%D1%80 ">watchdog) - hardware implemented system hang control scheme. Watchdog timers are used in systems that must operate without human supervision. Such systems should be self-healing without operator intervention.

In controllers  JetHub D1 and  ; JetHub H1 meson_wdt hardware watchdog driver supported.

Check devices

With the watchdog modules running correctly on /dev, the /dev/watchdog and /dev/watchdog0 devices should be visible:

root@jethubj100:~# ls -la /dev/watchdog*
crw------- 1 root root 10, 130 Feb 21 17:29 /dev/watchdog
crw------- 1 root root 246, 0 Feb 21 17:29 /dev/watchdog0
root@jethubj100:~#

Installing the Watchdog service

To install the watchdog service, run the following commands:

sudo apt-get update
sudo apt-get install watchdog

Firmware starting from Armbian 22.02 has watchdog pre-installed

Create a directory for the watchdog log files:

sudo mkdir -p /var/log/watchdog

Check the settings for the service in /etc/default/watchdog:

# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="none"
# Specify additional watchdog options here (see manpage).
watchdog_options="-s -v -c /etc/watchdog.conf"

Configuration files

Make the necessary changes to the configuration file /etc/watchdog.conf: * uncomment the use of the /dev/watchdog device (otherwise the watchdog service will not use the hardware timer to restart the controller) * set the necessary checks and timeouts

The following is an example of a config file, with a 15 second timeout set to hang:

$ cat /etc/watchdog.conf
#ping = 172.31.14.1
#ping = 172.26.1.255
#interface=eth0
#file = /var/log/messages
#change = 1407

# Uncomment to enable test. Setting one of these values to '0' disables it.
# These values will hopefully never reboot your machine during normal use
# (if your machine is really hung, the loadavg will go much higher than 25)
#max-load-1 = 24
#max-load-5 = 18
#max-load-15 = 12

# Note that this is the number of pages!
# To get the real size, check how large the pagesize is on your machine.
#min-memory = 1
#allocatable-memory = 1

#repair-binary = /usr/sbin/repair
#repair-timeout = 60
#test-binary=
#test-timeout = 60

# The retry-timeout and repair limit are used to handle errors in a more robust
# style. Errors must persist for longer than retry-timeout to action a repair
# or reboot, and if repair-maximum attempts are made without the test passing a
# reboot is initiated anyway.
#retry-timeout = 60
#repair-maximum = 1

watchdog-device=/dev/watchdog

# Defaults compiled into the binary
#temperature-sensor =
#max-temperature = 90

# Defaults compiled into the binary
#admin = root
#interval = 1
#logstick = 1
#log-dir = /var/log/watchdog

# This greatly decreases the chance that watchdog won't be scheduled before
#your machine is really loaded
realtime = yes
priority=1

# Check if rsyslogd is still running by enabling the following line
#pidfile = /var/run/rsyslogd.pid

watchdog-timeout = 15

The watchdog-timeout value determines how long after a watchdog service failure the hardware timer will restart the controller

Autostart and service check

To enable autostart of the service, run the following commands:

sudo systemctl enable watchdog.service
sudo systemctl start watchdog.service

Check if service is running:

root@jethubj100:~# service watchdog status
● watchdog.service - watchdog daemon
Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor pres>
Active: active (running) since Mon 2022-02-21 17:29:24 UTC; 17 hours ago
Process: 2718 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${w>
Process: 2720 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin>
Main PID: 2722 (watchdog)
Tasks: 1 (limit: 977)
Memory: 516.0K
CPU:3min 33.528s
CGroup: /system.slice/watchdog.service
└─2722 /usr/sbin/watchdog -s -v -c /etc/watchdog.conf

Feb 22 10:52:23 jethubj100 watchdog[2722]: still alive after 62076 interval(s)
Feb 22 10:52:24 jethubj100 watchdog[2722]: still alive after 62077 interval(s)
Feb 22 10:52:25 jethubj100 watchdog[2722]: still alive after 62078 interval(s)
Feb 22 10:52:26 jethubj100 watchdog[2722]: still alive after 62079 interval(s)
Feb 22 10:52:27 jethubj100 watchdog[2722]: still alive after 62080 interval(s)
Feb 22 10:52:28 jethubj100 watchdog[2722]: still alive after 62081 interval(s)
Feb 22 10:52:29 jethubj100 watchdog[2722]: still alive after 62082 interval(s)

A configured and running watchdog service constantly resets the hardware watchdog. If it fails to do this (for example, if the system freezes or any other condition configured in the config occurs), the timer will work and restart the controller.

Watchdog test

Be careful, the commands in this section cause the kernel to panic and cause the controller to completely stop working. Use them only in a test environment

echo c > /proc/sysrq-trigger

This command crashes the linux kernel artificially. If the watchdog is working correctly, it will automatically reboot the system after a timeout.

root@jethubj100:~# echo c > /proc/sysrq-trigger
[63168.053150] sysrq: Trigger a crash
[63168.053204] Kernel panic - not syncing: sysrq triggered crash
[63168.056648] CPU: 3 PID: 65544 Comm: bash Not tainted 5.15.24-meson64 #trunk.0045.jethome.0
[63168.064838] Hardware name: JetHome JetHub J100 (DT)
[63168.069670] Call trace:
[63168.072082] dump_backtrace+0x0/0x200
[63168.075706] show_stack+0x18/0x68
[63168.078982] dump_stack_lvl+0x68/0x84
[63168.082605] dump_stack+0x18/0x34
[63168.085882] panic+0x164/0x324
[63168.088901] sysrq_handle_crash+0x1c/0x20
[63168.092869] __handle_sysrq+0x8c/0x160
[63168.096577] write_sysrq_trigger+0x88/0x120
[63168.100719] proc_reg_write+0xac/0xf8
[63168.104340] vfs_write+0xbc/0x398
[63168.107618] ksys_write+0x68/0xf0
[63168.110895] __arm64_sys_write+0x1c/0x28
[63168.114776] invoke_syscall+0x44/0x108
[63168.118485] el0_svc_common.constprop.3+0x94/0xf8
[63168.123143] do_el0_svc+0x24/0x88
[63168.126420] el0_svc+0x20/0x50
[63168.129439] el0t_64_sync_handler+0x90/0xb8
[63168.133579] el0t_64_sync+0x180/0x184
[63168.137206] SMP: stopping secondary CPUs
[63168.141091] Kernel Offset: disabled
[63168.144533] CPU features: 0x00001001,00000846
[63168.148845] Memory Limit: none
[63168.151871] ---[ end Kernel panic - not syncing: sysrq triggered crash ]---

Pause 15 seconds (according to the value of watchdog_timeout in the configuration file)

AXG:BL1:d1dbf2:a4926f;FEAT:E0DC318C:2000;POC:F;EMMC:0;READ:0;0.0;CHK:0;
sdio debug board detected
TE: 33151

BL2 Built : 10:43:22, May 26 2021. axg g28b9431 - jenkins@walle02-sh
set vcck to 1100mv
set vddee to 950mv
Board ID = 9
CPU clk: 1200MHz
DDR low power enabled
DDR3 chl: Rank0 16bit @ 912MHz
bist_test rank: 0 1b 02 34 24 0b 3d 17 00 2f 27 0e 40 00 00 00 00 00 00 00 0000 00 00 00 761 - PASS
Rank0: 1024MB(auto)-2T-13
AddrBus test pass!
eMMC boot@0
sw8s
storage init finish
emmc switch 3 ok
Authentication key not yet programmed
get rpmb counter error 0x00000007
emmc switch 0 ok
Load FIP TMP HDR from eMMC, src: 0x0000c200, des: 0x05100000, size: 0x00004000, part: 0
0001c000Load BL31 from eMMC, src: 0x0001c200, des: 0x05104000, size: 0x0002ac00, part: 0
bl2z: ptr: 05127358, size: 00001e18
Load FIP HDR from eMMC, src: 0x0000c200, des: 0x01700000, size: 0x00004000, part: 0
Load BL3x from eMMC, src: 0x00010200, des: 0x01704000, size: 0x0008e400, part: 0
NOTICE: BL31: v1.3(release):110e239
NOTICE: BL31: Built : 19:07:23, Jul 2 2018
NOTICE: BL31: AXG normal boot!
NOTICE: BL31: BL33 decompress pass
[Image: axg_v1.1.3326-d0bacc8 2018-07-05 11:21:34 jenkins@walle02-sh]
OPS=0x43
25 0b 43 00 88 fc 1a 07 8d 24 0c 3a b2 65 16 59
bl30:axg ver: 9 mode: 0
bl30:axg thermal0
[0.015862 Inits done]
secure task start!
high task start!
low task start!
ERROR: Error initializing runtime service opteed_fast


U-Boot 2022.01-armbian (Feb 08 2022 - 06:07:00 +0000) jethubj100