nginx服务器如何监控 - 问答

Nginx 服务器监控实践指南

一监控目标与关键指标

基本活跃指标：关注 Active connections、以及状态行中的 Accepts/Handled/Requests；若 Accepts ≠ Handled，说明有连接被丢弃，通常与 worker_connections 等资源限制有关。状态子状态 Reading/Writing/Waiting 有助于判断并发压力与连接复用情况（开启 keep-alive 时 Waiting 较多属正常）。
吞吐与错误：持续观测 QPS/每秒请求数 与 HTTP 状态码分布（4xx/5xx），5xx 代表服务端异常需优先排查。
延迟与上游：利用日志字段 $request_time（总耗时）与 $upstream_response_time（上游耗时）绘制 TP50/TP90/TP99，定位是 Nginx 本身还是上游瓶颈。
饱和度与容量：关注 CPU、内存、文件描述符、连接数 等利用率；Nginx 可承受的最大并发连接数约为 worker_processes × worker_connections，应结合业务峰值与留白设定阈值。

二数据采集与可视化方案

轻量方式：启用 ngx_http_stub_status_module 暴露状态页，配合 Prometheus 的 nginx-prometheus-exporter 采集，在 Grafana 展示。
1. 确认模块：执行 nginx -V，输出应包含 –with-http_stub_status_module。
2. 配置状态页：
```
server {
  listen 80;
  server_name localhost;
  location /nginx_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    allow <监控服务器IP>;
    deny all;
  }
}
# 重载
sudo nginx -s reload
```
访问 http:///nginx_status 可见：Active connections、Accepts/Handled/Requests、Reading/Writing/Waiting。
3) 部署 exporter：
```
./nginx-prometheus-exporter -nginx.scrape-uri=http://localhost/nginx_status
# 默认 :9113/metrics
```
1. Prometheus 抓取：
```
scrape_configs:
  - job_name: 'nginx'
    static_configs:
      - targets: ['<exporter-ip>:9113']
```
1. Grafana 导入 Nginx 仪表盘（如 4869），完成可视化。
日志方式：解析 access.log/error.log，在 ELK（Elasticsearch+Logstash+Kibana） 或 Grafana+Loki 中做指标与可视化；结合 $request_time/$upstream_response_time 统计 TP99 等延迟分位，并对 5xx 与关键错误进行告警。

三告警规则示例

高 5xx 错误率（5 分钟内）：

- alert: HighNginxErrorRate
  expr: rate(nginx_http_requests_total{status=~"5.."}[5m]) / rate(nginx_http_requests_total[5m]) > 0.05
  for: 2m
  labels: severity: warning
  annotations:
    summary: "High error rate on Nginx"
    description: "More than 5% of requests are returning 5xx errors."

活跃连接数过高：

- alert: TooManyActiveConnections
  expr: nginx_connections_active > 1000
  for: 1m
  labels: severity: critical
  annotations:
    summary: "Too many active connections"
    description: "Active connections exceed 1000!"

连接被丢弃（Accepts 与 Handled 增长不一致）：

- alert: NginxConnectionsDropped
  expr: increase(nginx_http_requests_total{status!~"4..|5.."}[5m]) < increase(nginx_connections_accepted[5m])
  for: 5m
  labels: severity: critical
  annotations:
    summary: "Nginx dropped connections detected"
    description: "Accepts grew faster than handled; possible worker_connections limit."

延迟异常（TP99 上升）：

- alert: HighRequestLatency
  expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[5m])) by (le)) > 1
  for: 5m
  labels: severity: warning
  annotations:
    summary: "High 99th percentile latency"
    description: "99th percentile request latency is above 1s."

建议将告警接入 Alertmanager，通过 邮件、钉钉、企业微信 等渠道通知，并配置 分组、抑制、静默 降低噪音。

四进阶与排错建议

语义探活优于端口探活：从本机以 HTTP GET / 校验返回码与响应内容，避免因端口存活但业务异常而漏报。
关注连接复用：开启 keep-alive 时 Waiting 较多是正常现象；若 Reading+Writing 持续偏高，多为并发处理或上游慢导致。
容量规划：结合 worker_processes/worker_connections 与业务峰值，预留 20%–30% 余量；监控 CPU/内存/连接数/磁盘 IO 等饱和度指标。
主动压测：上线或调优后用 ab/wrk 验证，例如：
```
ab -n 1000 -c 100 http://<域名>/<路径>
```
观察 Requests per second 与 Time per request，与线上指标对比评估优化效果。
企业版能力：若使用 NGINX Plus，可直接调用 NGINX Plus API 获取更丰富的健康与性能指标，便于与现有监控平台集成。

0 赞

0 踩