基于 systemd 创建 Linux service 启动顺序和检测故障重启

背景

团队基于 Armbian 设计了一个 LoRa 网关,它要求上电后开始运行主程序 packet_forwarder (它实现 LoRa<-(转)->UDP 与服务器通信)。
这本来是一个简单的需求,将其设计成一个 service 加载到 systemd 中就可以完成,该 rime_gateway.service 代码如下:

[Unit]
Description=Rime LoRaWAN Gateway

[Service]
WorkingDirectory=/home/rime/packet_forwarder/lora_pkt_fwd
ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh
Restart=always

[Install]
WantedBy=multi-user.target

语法解释请参考 Systemd 入门教程:命令篇

不稳定的服务

当使用 systemctl start rime_gateway.service 手动启动时,它工作得很好。

然而,当 Armbian 上电自启动后,使用 systemctl status rime_gateway.service 查看发现该服务已经停止工作:

rime_gateway.service - Rime LoRaWAN Gateway
   Loaded: loaded (/lib/systemd/system/rime_gateway.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2020-04-20 06:51:46 UTC; 29s ago
  Process: 1112 ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh (code=exited, status=1/FAILURE)
 Main PID: 1112 (code=exited, status=1/FAILURE)

Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 5.
Apr 20 06:51:46 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Start request repeated too quickly.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 06:51:46 orangepizero systemd[1]: Failed to start Rime LoRaWAN Gateway.

上面的语句显示服务重启太快,系统退出重启。

使用 journalctl -u rime_gateway.service 查看日志,系统以 100ms 间隔 5 次重启都失败。

-- Logs begin at Mon 2020-04-20 06:51:31 UTC, end at Mon 2020-04-20 06:55:01 UTC. --
Apr 20 06:51:40 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
Apr 20 06:51:40 orangepizero start_gateway.sh[572]: Reset start_gateway.sh
Apr 20 06:51:41 orangepizero start_gateway.sh[572]: Starting start_gateway.sh
Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 1.

。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。

Apr 20 06:51:45 orangepizero start_gateway.sh[1112]: Reset start_gateway.sh
Apr 20 06:51:46 orangepizero start_gateway.sh[1112]: Starting start_gateway.sh
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 5.
Apr 20 06:51:46 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Start request repeated too quickly.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 06:51:46 orangepizero systemd[1]: Failed to start Rime LoRaWAN Gateway.

查看网关日志,发现失败的原因是网络没有建立成功 tail -f /tmp/start_gateway.sh.log

ERROR: [up] connect returned Network is unreachable

修改启动顺序

很明显,该服务依赖于网络的建立,因此,首先添加如下语句

After=network.target

这个启动顺序生效了吗?为此,我们导出并查看了启动顺序

systemd-analyze plot > boot.svg

使用 chrome 浏览器打开 boot.svg 发现:先启动 network.target,后启动 rime_gateway.service


更多启动顺序请参考 Linux systemd启动守护进程,service启动顺序分析及调整service启动顺序

检测故障重启

为了让服务更健壮,检测到失败退出时自动重启。为此,添加了如下的代码。

systemd 将尝试永久重启服务

StartLimitIntervalSec=0

每隔 1 秒重启服务是个好主意,以避免在出现问题时对服务器施加太大压力。

RestartSec=1

更多自动重启请参考 使用systemd创建Linux服务

稳定的服务

最终的 rime_gateway.service 代码如下所示

[Unit]
Description=Rime LoRaWAN Gateway
After=network.target
StartLimitIntervalSec=0

[Service]
WorkingDirectory=/home/rime/packet_forwarder/lora_pkt_fwd
ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target

使用 systemctl status rime_gateway.service 和 journalctl -u rime_gateway.service 查看日志,服务正常启动。

在异常的情况下,先拔出网线,再重启 Armbian,发现 systemd 以每隔 1 秒间隔启动服务,直到网络恢复正常为止(本案例重启 78 次)。

-- Logs begin at Mon 2020-04-20 07:32:09 UTC, end at Mon 2020-04-20 07:35:12 UTC. --
Apr 20 07:32:19 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
Apr 20 07:32:20 orangepizero start_gateway.sh[839]: Reset start_gateway.sh
Apr 20 07:32:20 orangepizero start_gateway.sh[839]: Starting start_gateway.sh
Apr 20 07:32:20 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 07:32:20 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 07:32:21 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=1s expired, scheduling restart.
Apr 20 07:32:21 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 1.
Apr 20 07:32:21 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
Apr 20 07:32:21 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
Apr 20 07:32:22 orangepizero start_gateway.sh[991]: Reset start_gateway.sh
Apr 20 07:32:22 orangepizero start_gateway.sh[991]: Starting start_gateway.sh

。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。

Apr 20 07:34:54 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 07:34:54 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 07:34:55 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=1s expired, scheduling restart.
Apr 20 07:34:55 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 78.
Apr 20 07:34:55 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
Apr 20 07:34:55 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
Apr 20 07:34:55 orangepizero start_gateway.sh[2644]: Reset start_gateway.sh
Apr 20 07:34:56 orangepizero start_gateway.sh[2644]: Starting start_gateway.sh

—————————P2————————–

# Systemd-控制进程的启动顺序

# 问题描述

问题描述:重启linux系统以后,发现微信公众号的无法提供服务,登录服务器排查,mysql服务正常启动。查看supervisor的日志,发现启动uwsgi进程的时候报错,而且提示是数据库连接错误。考虑到mysql和supervisor都是通过systemctl实现开机自动启动,所以应该是supervisor的启动先于mysql,所以出现了连接失败。

# 解决方式

systemctl可以通过Before和After参数控制进程的启动顺序。

vim /lib/systemd/system/supervisor.service

[Unit]
After=mariadb.service

多个进程可以写成

After=syslog.target network.target remote-fs.target nss-lookup.target
systemctl daemon-reload
systemctl enable yourservice
syetemctl restart yourservice

参考资料:

上一篇
下一篇