Watchdog (watchdog) in Linux (Armbian/Ubuntu/Debian)

Introduction

Watchdog ( watchdog ) - a hardware-implemented scheme for controlling system hangs. Watchdog timers are used in systems that must operate without human supervision. Such systems should be self-healing without operator involvement.

The JetHub D1 and JetHub H1 controllers support the meson_wdt hardware watchdog driver.

Checking devices

With the watchdog modules running correctly on the /dev system, the /dev/watchdog and /dev/watchdog0 devices should be visible:

 root@jethubj100:~# ls -la /dev/watchdog*
crw------- 1 root root 10, 130 Feb 21 17:29 /dev/watchdog
crw------- 1 root root 246, 0 Feb 21 17:29 /dev/watchdog0
root@jethubj100:~#

Installing Watchdog Service

To install the watchdog service, run the following commands:

 sudo apt-get update
sudo apt-get install watchdog

In firmware starting from version Armbian 22.02 watchdog is preinstalled

Create a folder for the watchdog log files:

 sudo mkdir -p /var/log/watchdog

Check the settings for the service in the /etc/default/watchdog file:

 # Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="none"
# Specify additional watchdog options here (see manpage).
watchdog_options="-s -v -c /etc/watchdog.conf"

Configuration files

Make the necessary changes to the configuration file /etc/watchdog.conf: * uncomment the use of the /dev/watchdog device (otherwise the watchdog service will not use the hardware timer to restart the controller) * set the necessary checks and timeouts

Below is an example configuration file, with a 15 second hang timeout set:

 $ cat /etc/watchdog.conf
#ping = 172.31.14.1
#ping = 172.26.1.255
#interface=eth0
#file = /var/log/messages
#change = 1407
 
# Uncomment to enable test. Setting one of these values to '0' disables it.
# These values will hopefully never reboot your machine during normal use
# (if your machine is really hung, the loadavg will go much higher than 25)
#max-load-1 = 24
#max-load-5 = 18
#max-load-15 = 12
 
# Note that this is the number of pages!
# To get the real size, check how large the pagesize is on your machine.
#min-memory = 1
#allocatable-memory = 1
 
#repair-binary = /usr/sbin/repair
#repair-timeout = 60
#test-binary=
#test-timeout = 60
 
# The retry-timeout and repair limit are used to handle errors in a more robust
# style. Errors must persist for longer than retry-timeout to action a repair
# or reboot, and if repair-maximum attempts are made without the test passing a
# reboot is initiated anyway.
#retry-timeout = 60
#repair-maximum = 1
 
watchdog-device=/dev/watchdog
 
# Defaults compiled into the binary
#temperature-sensor =
#max-temperature = 90
 
# Defaults compiled into the binary
#admin = root
#interval = 1
#logstick = 1
#log-dir = /var/log/watchdog
 
# This greatly decreases the chance that watchdog won't be scheduled before
#your machine is really loaded
realtime = yes
priority=1
 
# Check if rsyslogd is still running by enabling the following line
#pidfile = /var/run/rsyslogd.pid
 
watchdog-timeout = 15

The watchdog-timeout value determines how long after the watchdog service fails, the hardware timer will reset the controller

Autostart and service check

To enable autostart of the service, run the following commands:

 sudo systemctl enable watchdog.service
sudo systemctl start watchdog.service

Service health check:

 root@jethubj100:~# service watchdog status
● watchdog.service - watchdog daemon
     Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor pres>
     Active: active (running) since Mon 2022-02-21 17:29:24 UTC; 17 hours ago
    Process: 2718 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${w>
    Process: 2720 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin>
   Main PID: 2722 (watchdog)
      Tasks: 1 (limit: 977)
     Memory: 516.0K
        CPU:3min 33.528s CGroup: /system.slice/watchdog.service └─2722 /usr/sbin/watchdog -s -v -c /etc/watchdog.conf Feb 22 10:52:23 jethubj100 watchdog[2722]: still alive after 62076 interval(s) Feb 22 10:52:24 jethubj100 watchdog[2722]: still alive after 62077 interval(s) Feb 22 10:52:25 jethubj100 watchdog[2722]: still alive after 62078 interval(s) Feb 22 10 :52:26 jethubj100 watchdog[2722]: still alive after 62079 interval(s) Feb 22 10:52:27 jethubj100 watchdog[2722]: still alive after 62080 interval(s) Feb 22 10:52:28 jethubj100 watchdog[2722 ]: still alive after 62081 interval(s) Feb 22 10:52:29 jethubj100 watchdog[2722]: still alive after 62082 interval(s)

A configured and running watchdog service constantly resets the hardware watchdog timer. If he cannot do this (for example, if the system freezes or any other condition configured in the config occurs), the timer will work and restart the controller.

Checking the watchdog timer

Be careful: the commands in this section cause the kernel to panic and completely stop the controller. Use them only in a test environment

 echo c > /proc/sysrq-trigger

This command crashes the linux kernel artificially. If the watchdog is working correctly, then it will automatically reboot the system after a timeout.

 root@jethubj100:~# echo c > /proc/sysrq-trigger
[63168.053150] sysrq: Trigger a crash
[63168.053204] Kernel panic - not syncing: sysrq triggered crash
[63168.056648] CPU: 3 PID: 65544 Comm: bash Not tainted 5.15.24-meson64 #trunk.0045.jethome.0
[63168.064838] Hardware name: JetHome JetHub J100 (DT)
[63168.069670] Call trace:
[63168.072082] dump_backtrace+0x0/0x200
[63168.075706] show_stack+0x18/0x68
[63168.078982] dump_stack_lvl+0x68/0x84
[63168.082605] dump_stack+0x18/0x34
[63168.085882] panic+0x164/0x324
[63168.088901] sysrq_handle_crash+0x1c/0x20
[63168.092869] __handle_sysrq+0x8c/0x160
[63168.096577] write_sysrq_trigger+0x88/0x120
[63168.100719] proc_reg_write+0xac/0xf8
[63168.104340] vfs_write+0xbc/0x398
[63168.107618] ksys_write+0x68/0xf0
[63168.110895] __arm64_sys_write+0x1c/0x28
[63168.114776] invoke_syscall+0x44/0x108
[63168.118485] el0_svc_common.constprop.3+0x94/0xf8
[63168.123143] do_el0_svc+0x24/0x88
[63168.126420] el0_svc+0x20/0x50
[63168.129439] el0t_64_sync_handler+0x90/0xb8
[63168.133579] el0t_64_sync+0x180/0x184
[63168.137206] SMP: stopping secondary CPUs
[63168.141091] Kernel Offset: disabled
[63168.144533] CPU features: 0x00001001,00000846
[63168.148845] Memory Limit: none
[63168.151871] ---[ end Kernel panic - not syncing: sysrq triggered crash ]---

Pause 15 seconds (according to the value of watchdog_timeout in the configuration file)

AXG:BL1:d1dbf2:a4926f;FEAT:E0DC318C:2000;POC:F;EMMC:0;READ:0;0.0;CHK:0;                                                                                                                    
sdio debug board detected                                                                                                                                                                  
TE: 33151                                                                                                                                                                                  
                                                                                                                                                                                           
BL2 Built : 10:43:22, May 26 2021. axg g28b9431 - jenkins@walle02-sh                                                                                                                       
set vcck to 1100mv                                                                                                                                                                        
set vddee to 950mv                                                                                                                                                                        
Board ID = 9                                                                                                                                                                               
CPU clk: 1200MHz                                                                                                                                                                           
DDR low power enabled                                                                                                                                                                      
DDR3 chl: Rank0 16bit @ 912MHz                                                                                                                                                             
bist_test rank: 0 1b 02 34 24 0b 3d 17 00 2f 27 0e 40 00 00 00 00 00 00 00 0000 00 00 00 761 - PASS Rank0: 1024MB(auto)-2T-13 AddrBus test pass! eMMC boot @ 0 sw8 s storage init finish emmc switch 3 ok Authentication key not yet programmed get rpmb counter error 0x00000007 emmc switch 0 ok Load FIP TMP HDR from eMMC, src: 0x0000c200, des: 0x05100000, size: 0x00004000, part: 0 0001c000 BL31 from eMMC, src: 0x0001c200, des: 0x05104000, size: 0x0002ac00, part: 0 bl2z: ptr: 05127358, size: 00001e18 BL3x from eMMC, src: 0x00010200, des: 0x01704000, size: 0x0008e400, part: 0 NOTICE: BL31: v1.3(release):110e239 NOTICE: BL31: Built : 19:07:23, Jul 2 2018 AXG normal boot! NOTICE: BL31: BL33 decompress pass bl30:axg ver: 9 mode: 0 bl30:axg thermal0 [0.015862 Inits done] secure task start! high task start! low task start! ERROR: Error initializing runtime service opteed_fast U-Boot 2022.01-armbian (Feb 08 2022 - 06:07:00 +0000) jethubj100