Watchdog in Linux

The Watchdog is a hardware-implemented system hang control scheme.

Watchdog timers are used in systems that must operate without human supervision. Such systems must be self-restoring without operator intervention.

Note

The JetHome JetHub controllers based on Amlogic processors support the hardware watchdog driver meson_wdt.

Checking devices

With the watchdog module running correctly, /dev should be able to see the devices /dev/watchdog and /dev/watchdog0:

$ ls -l /dev/watchdog*
crw-rw---- 1 root root 10, 130 2019-01-01 00:00 /dev/watchdog
crw-rw---- 1 root root 10, 130 2019-01-01 00:00 /dev/watchdog0

Installing the Watchdog service

Note

In firmware versions starting from Armbian 22.02 watchdog is pre-installed, manual installation is not required.

To install the watchdog service, run the following commands:

sudo apt-get update
sudo apt-get install watchdog

Create a folder for watchdog log files:

sudo mkdir -p /var/log/watchdog

Check the settings for the service in the file /etc/default/watchdog:

# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="none"
# Specify additional watchdog options here (see manpage).
watchdog_options="-s -v -c /etc/watchdog.conf"

Configuration files

Make the necessary changes to the configuration file /etc/watchdog.conf:

  • Uncomment the use of the device /dev/watchdog (otherwise watchdog service will not use the hardware timer to reboot the controller).

  • Set the necessary checks and timeouts.

Below is an example of a configuration file, with a timeout set to 15 seconds for hanging:

Note

The value watchdog-timeout defines how long after the watchdog service failure the hardware timer will restart the controller.

$ cat /etc/watchdog.conf
#ping                   = 172.31.14.1
#ping                   = 172.26.1.255
#interface              = eth0
#file                   = /var/log/messages
#change                 = 1407

# Uncomment to enable test. Setting one of these values to '0' disables it.
# These values will hopefully never reboot your machine during normal use
# (if your machine is really hung, the loadavg will go much higher than 25)
#max-load-1             = 24
#max-load-5             = 18
#max-load-15            = 12

# Note that this is the number of pages!
# To get the real size, check how large the pagesize is on your machine.
#min-memory             = 1
#allocatable-memory     = 1

#repair-binary          = /usr/sbin/repair
#repair-timeout         = 60
#test-binary            =
#test-timeout           = 60

# The retry-timeout and repair limit are used to handle errors in a more robust
# manner. Errors must persist for longer than retry-timeout to action a repair
# or reboot, and if repair-maximum attempts are made without the test passing a
# reboot is initiated anyway.
#retry-timeout          = 60
#repair-maximum         = 1

watchdog-device = /dev/watchdog

# Defaults compiled into the binary
#temperature-sensor     =
#max-temperature        = 90

# Defaults compiled into the binary
#admin                  = root
#interval               = 1
#logtick                = 1
#log-dir                = /var/log/watchdog

# This greatly decreases the chance that watchdog won't be scheduled before
# your machine is really loaded
realtime                = yes
priority                = 1

# Check if rsyslogd is still running by enabling the following line
#pidfile                = /var/run/rsyslogd.pid

watchdog-timeout        = 15

Autostart and service check

To enable the service autorun, run the following commands:

sudo systemctl enable watchdog
sudo systemctl start watchdog

Check the serviceability of the service:

service watchdog status

Approximate conclusion:

 watchdog.service - watchdog daemon
     Loaded: loaded (/lib/systemd/system/watchdog.service; enabled; vendor pres>
     Active: active (running) since Mon 2022-02-21 17:29:24 UTC; 17h ago
    Process: 2718 ExecStartPre=/bin/sh -c [ -z "${watchdog_module}" ] || [ "${w>
    Process: 2720 ExecStart=/bin/sh -c [ $run_watchdog != 1 ] || exec /usr/sbin>
   Main PID: 2722 (watchdog)
      Tasks: 1 (limit: 977)
     Memory: 516.0K
        CPU: 3min 33.528s
     CGroup: /system.slice/watchdog.service
             └─2722 /usr/sbin/watchdog -s -v -c /etc/watchdog.conf

Feb 22 10:52:23 jethubj100 watchdog[2722]: still alive after 62076 interval(s)
Feb 22 10:52:24 jethubj100 watchdog[2722]: still alive after 62077 interval(s)
Feb 22 10:52:25 jethubj100 watchdog[2722]: still alive after 62078 interval(s)
Feb 22 10:52:26 jethubj100 watchdog[2722]: still alive after 62079 interval(s)
Feb 22 10:52:27 jethubj100 watchdog[2722]: still alive after 62080 interval(s)
Feb 22 10:52:28 jethubj100 watchdog[2722]: still alive after 62081 interval(s)
Feb 22 10:52:29 jethubj100 watchdog[2722]: still alive after 62082 interval(s)

Note

A configured and running watchdog service constantly resets the hardware watchdog timer.

If it fails to do so (for example, if the system hangs or any other configured condition occurs), the timer will trigger and restart the controller.

Watchdog timer check

Warning

Be careful: the commands in this section cause the kernel to panic and stop the controller completely.

Use them only in a test environment!

The following command causes the linux kernel to crash artificially. If watchdog works correctly, it will automatically reboot the system after a timeout:

echo c > /proc/sysrq-trigger

Approximate conclusion:

[63168.053150] sysrq: Trigger a crash
[63168.053204] Kernel panic - not syncing: sysrq triggered crash
[63168.056648] CPU: 3 PID: 65544 Comm: bash Not tainted 5.15.24-meson64 #trunk.0045.jethome.0
[63168.064838] Hardware name: JetHome JetHub J100 (DT)
[63168.069670] Call trace:
[63168.072082]  dump_backtrace+0x0/0x200
[63168.075706]  show_stack+0x18/0x68
[63168.078982]  dump_stack_lvl+0x68/0x84
[63168.082605]  dump_stack+0x18/0x34
[63168.085882]  panic+0x164/0x324
[63168.088901]  sysrq_handle_crash+0x1c/0x20
[63168.092869]  __handle_sysrq+0x8c/0x160
[63168.096577]  write_sysrq_trigger+0x88/0x120
[63168.100719]  proc_reg_write+0xac/0xf8
[63168.104340]  vfs_write+0xbc/0x398
[63168.107618]  ksys_write+0x68/0xf0
[63168.110895]  __arm64_sys_write+0x1c/0x28
[63168.114776]  invoke_syscall+0x44/0x108
[63168.118485]  el0_svc_common.constprop.3+0x94/0xf8
[63168.123143]  do_el0_svc+0x24/0x88
[63168.126420]  el0_svc+0x20/0x50
[63168.129439]  el0t_64_sync_handler+0x90/0xb8
[63168.133579]  el0t_64_sync+0x180/0x184
[63168.137206] SMP: stopping secondary CPUs
[63168.141091] Kernel Offset: disabled
[63168.144533] CPU features: 0x00001001,00000846
[63168.148845] Memory Limit: none
[63168.151871] ---[ end Kernel panic - not syncing: sysrq triggered crash ]---

AXG:BL1:d1dbf2:a4926f;FEAT:E0DC318C:2000;POC:F;EMMC:0;READ:0;0.0;CHK:0;
sdio debug board detected
TE: 33151

Pause 15 seconds (according to the value watchdog_timeout in the configuration file), then

BL2 Built : 10:43:22, May 26 2021. axg g28b9431 - jenkins@walle02-sh
set vcck to 1100 mv
set vddee to 950 mv
Board ID = 9
CPU clk: 1200MHz
DDR low power enabled
DDR3 chl: Rank0 16bit @ 912MHz
bist_test rank: 0 1b 02 34 24 0b 3d 17 00 2f 27 0e 40 00 00 00 00 00 00 00 00 00 00 00 00 761   - PASS
Rank0: 1024MB(auto)-2T-13
AddrBus test pass!
eMMC boot @ 0
sw8 s
storage init finish
emmc switch 3 ok
Authentication key not yet programmed
get rpmb counter error 0x00000007
emmc switch 0 ok
Load FIP TMP HDR from eMMC, src: 0x0000c200, des: 0x05100000, size: 0x00004000, part: 0
0001c000Load BL31 from eMMC, src: 0x0001c200, des: 0x05104000, size: 0x0002ac00, part: 0
bl2z: ptr: 05127358, size: 00001e18
Load FIP HDR from eMMC, src: 0x0000c200, des: 0x01700000, size: 0x00004000, part: 0
Load BL3x from eMMC, src: 0x00010200, des: 0x01704000, size: 0x0008e400, part: 0
NOTICE:  BL31: v1.3(release):110e239
NOTICE:  BL31: Built : 19:07:23, Jul  2 2018
NOTICE:  BL31: AXG normal boot!
NOTICE:  BL31: BL33 decompress pass
[Image: axg_v1.1.3326-d0bacc8 2018-07-05 11:21:34 jenkins@walle02-sh]
OPS=0x43
25 0b 43 00 88 fc 1a 07 8d 24 0c 3a b2 65 16 59
bl30:axg ver: 9 mode: 0
bl30:axg thermal0
[0.015862 Inits done]
secure task start!
high task start!
low task start!
ERROR:   Error initializing runtime service opteed_fast


U-Boot 2022.01-armbian (Feb 08 2022 - 06:07:00 +0000) jethubj100