ubuntu触发器如何监控与告警 - 问答

Ubuntu系统监控触发器与告警实现指南
在Ubuntu系统中，“触发器”通常指系统状态异常（如服务宕机、资源超阈值、文件变化等）的检测条件，而“告警”是将异常信息通过指定渠道（邮件、短信、界面通知等）发送给管理员的过程。以下是常用监控与告警方法，覆盖命令行、图形界面及企业级工具：

一、基础命令行工具：快速实现简单监控与告警

命令行工具适合快速检查系统状态，通过脚本结合cron可实现基础告警：

实时查看日志（tail -f）
用于监控系统/应用日志的实时变化（如Nginx错误日志），当出现关键字（如“error”“failed”）时可手动触发告警。
示例：tail -f /var/log/nginx/error.log | grep --line-buffered "error" | while read line; do echo "$(date): $line" | mail -s "Nginx Error Alert" admin@example.com; done
说明：实时读取日志文件，匹配“error”关键字后发送邮件。
监控文件系统变化（inotifywait）
用于监控指定目录/文件的创建、修改、删除等事件（如/etc目录下的配置文件变更），触发告警。
示例：inotifywait -m -e modify,create,delete /etc | while read path action file; do echo "$(date): File $file in $path was $action" | mail -s "File Change Alert" admin@example.com; done
说明：-m表示持续监控，-e指定事件类型。
定期检查触发器状态（cron + 脚本）
通过cron定时运行脚本，检查系统指标（如CPU使用率、服务状态），超过阈值则发送告警。
示例脚本（/usr/local/bin/check_trigger.sh）：
```
#!/bin/bash
CPU_THRESHOLD=80
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk '{print 100 - $1}')
if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then
    echo "$(date): CPU usage is ${CPU_USAGE}%, exceeding threshold ${CPU_THRESHOLD}%" | mail -s "CPU High Alert" admin@example.com
fi
```
添加cron任务（每5分钟运行一次）：crontab -e → */5 * * * * /usr/local/bin/check_trigger.sh
说明：使用top获取CPU使用率，通过bc计算是否超过阈值。

二、图形界面工具：直观监控与告警

图形界面工具适合日常运维，无需记忆命令，支持实时可视化：

Glances（跨平台实时监控）
支持CPU、内存、磁盘、网络、传感器（温度）等指标监控，阈值告警（如CPU超过80%变红），可通过Web界面或客户端远程查看。
安装与配置：
```
sudo apt install glances  # Ubuntu 16.04+可直接安装
sudo pip install glances  # 旧版本可能需要pip安装
```
启动：glances（默认终端运行），按c排序CPU、m排序内存，按q退出；
远程监控：服务器端启动glances -s（服务端），客户端运行glances -c <服务器IP>。
系统自带监控工具（System Monitor）
Ubuntu自带的图形化工具，实时显示CPU、内存、磁盘、网络使用率，支持查看进程详情，适合快速排查资源瓶颈。
启动：点击菜单 → “系统监视器”（或运行gnome-system-monitor）。
Conky（高度可定制桌面监控）
在桌面直接显示系统信息（如CPU、内存、磁盘空间、网络流量），支持自定义脚本（如触发器状态），适合个性化需求。
安装：sudo apt install conky；
配置：编辑~/.conkyrc文件，添加如下内容监控CPU：
```
${color white}CPU Usage:${color} $cpu% (${cpubar})
${color red}High CPU Alert${color} $execi 5 'if [ $(top -bn1 | grep "Cpu(s)" | sed "s/.*, *$[0-9.]*$%* id.*/\1/" | awk "{print 100 - \$1}") -gt 80 ]; then echo "CPU > 80%"; else echo ""; fi'
```
说明：$execi 5表示每5秒执行一次脚本，$cpubar显示CPU使用率条形图。

三、企业级监控系统：全面的触发器与告警解决方案

企业级工具支持分布式监控、自定义规则、多渠道告警，适合大规模服务器集群：

Prometheus + Alertmanager（时间序列监控+告警）
- Prometheus：收集系统/应用指标（通过Exporters，如node_exporter监控主机、nginx_exporter监控Nginx），存储为时间序列数据；
- Alertmanager：处理Prometheus触发的告警，支持邮件、Slack、PagerDuty等多渠道通知。
  安装步骤：
- 安装Prometheus：sudo apt install prometheus，编辑/etc/prometheus/prometheus.yml添加监控目标（如node_exporter的localhost:9100）；
- 安装Alertmanager：sudo apt install alertmanager，编辑/etc/alertmanager/alertmanager.yml配置邮件通知（SMTP信息）；
- 创建报警规则（/etc/prometheus/rules.yml）：
```
groups:
- name: node_rules
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% for 5 minutes."
```
- 重启服务：sudo systemctl restart prometheus alertmanager。
Zabbix（企业级综合监控）
支持服务器、网络设备、应用的全面监控，提供Web界面配置触发器（如“Nginx服务停止”“磁盘空间剩余10%以下”），支持自动修复（如重启服务）。
安装步骤：
- 安装Zabbix Server、Frontend、Agent：sudo apt install zabbix-server-mysql zabbix-frontend-php zabbix-apache-conf zabbix-agent；
- 配置数据库（MySQL）：创建Zabbix数据库，导入初始Schema；
- 访问Web界面（http://<服务器IP>/zabbix），完成初始化配置（如添加主机、配置监控项）；
- 设置触发器：在“Configuration → Hosts → Triggers”中添加规则（如“{nginx:up}=0”表示Nginx宕机）。
Nagios（经典监控系统）
适合传统运维场景，支持服务可用性监控（如HTTP、FTP、SSH），通过插件扩展功能（如check_disk检查磁盘空间、check_load检查负载）。
安装步骤：
- 安装Nagios：sudo apt install nagios3 nagios-plugins；
- 配置监控项：编辑/etc/nagios3/conf.d/localhost_nagios2.cfg，添加服务检查（如HTTP检查）：
```
define service {
    use                 generic-service
    host_name           localhost
    service_description HTTP
    check_command       check_http
}
```
- 设置告警：编辑/etc/nagios3/contacts_nagios2.cfg，添加联系人邮箱，配置通知命令（如notify-by-email）。

四、自定义脚本：灵活适配特定需求

对于特殊触发器（如“某文件内容包含特定关键字”“数据库连接失败”），可通过Shell/Python脚本实现，结合cron或systemd定时运行，触发告警。
示例（Python脚本监控文件关键字）：

#!/usr/bin/env python3
import smtplib
from email.mime.text import MIMEText
import time

def send_alert(message):
    sender = 'alert@example.com'
    receiver = 'admin@example.com'
    msg = MIMEText(message)
    msg['Subject'] = 'Trigger Alert'
    msg['From'] = sender
    msg['To'] = receiver
    with smtplib.SMTP('smtp.example.com', 587) as server:
        server.starttls()
        server.login('alert', 'password')
        server.sendmail(sender, receiver, msg.as_string())

def check_file():
    with open('/path/to/file.log', 'r') as f:
        content = f.read()
        if 'ERROR' in content:
            send_alert('Error found in file!')

while True:
    check_file()
    time.sleep(60)  # 每分钟检查一次

说明：脚本每分钟检查文件是否包含“ERROR”，若存在则发送邮件告警。

以上方法覆盖了从基础到高级的监控与告警需求，可根据系统规模、技术栈、预算选择合适的方案。例如：

小型服务器：Glances + cron脚本；
中型企业：Prometheus + Alertmanager；
大型复杂环境：Zabbix或Nagios。

0 赞

0 踩