Linux GitLab如何监控与日志 - 问答

Linux 上 GitLab 的监控与日志实践

一监控体系与工具

系统层监控：使用 top/htop、vmstat、iostat、sar、netstat/ss、dstat 等实时查看 CPU、内存、磁盘 I/O、网络 等资源，配合 Prometheus Node Exporter 采集主机指标，便于容量与瓶颈定位。
应用层监控：GitLab 内置 Prometheus 指标端点 与 Grafana 可视化，可覆盖请求延迟、错误率、Sidekiq 队列、Puma/Workhorse、Gitaly、PostgreSQL、Redis 等关键组件；企业版提供 审计事件 能力。
自监控与性能剖析：启用 Self Monitoring 项目查看实例自身指标；在管理区域开启 Performance Bar 获取单次请求各阶段耗时（SQL、视图、Gitaly、Redis 等）。
告警：使用 Prometheus Alertmanager 配置阈值与通知（如 邮件、企业微信、Slack），对 CPU/内存、HTTP 5xx、Sidekiq 堆积、磁盘空间 等设定规则。

二日志体系与关键文件

日志目录与组件：Omnibus 安装日志集中在 /var/log/gitlab/，常见组件包括 gitlab-rails、nginx、puma、sidekiq、gitaly、postgresql、redis、registry、workhorse、prometheus、grafana 等。
结构化日志：Rails 提供 production_json.log（每行 JSON，便于 Elasticsearch/Splunk 解析），包含 method、path、controller、action、status、duration_s、db_duration_s、gitaly_calls/duration_s、redis_calls/duration_s、correlation_id、user_id、remote_ip 等；API 请求写入 api_json.log；异常包含 exception.class/message/backtrace。
文本日志：production.log 记录请求与 SQL，便于开发/排障。
日志轮转：部分组件由 logrotate 管理，部分由 runit/svlogd 管理（如 Alertmanager、Gitaly、PostgreSQL、Prometheus、Redis 等由 svlogd 写入 current 文件）；Omnibus 内置 logrotate 负责大多数日志轮转。
审计与合规：企业版提供 审计事件，可结合日志进行合规审查。

三快速上手监控落地步骤

启用与验证指标：确认 Prometheus 已抓取 GitLab 各组件指标端点（如 /metrics），在 Grafana 添加 Prometheus 数据源并导入官方/社区仪表盘。
主机指标：部署 Node Exporter，在 Prometheus 中新增 node job，监控 CPU、内存、磁盘、网络 与 文件系统 使用率。
告警规则示例（Prometheus）：
- 主机 CPU 持续高于 80% 持续 5 分钟
  - 规则：ALERT HighCPUUsage IF avg by(instance)(rate(node_cpu_seconds_total{mode!=“idle”}[5m])) > 0.8 FOR 5m
- GitLab HTTP 5xx 增多
  - 规则：ALERT GitLabHigh5xx IF sum(rate(nginx_http_requests_total{status=~“5…”}[5m])) / sum(rate(nginx_http_requests_total[5m])) > 0.01 FOR 5m
通知：配置 Alertmanager 路由与接收器（邮件/企业微信/Slack），对 P1/P2 分级告警。

四快速上手日志落地步骤

实时查看与检索：
- 组件实时日志：gitlab-ctl tail（如：gitlab-ctl tail nginx；gitlab-ctl tail gitlab-rails）
- 文件实时跟踪：tail -f /var/log/gitlab/gitlab-rails/production.log
- 关键字检索：grep -i “error” /var/log/gitlab/gitlab-rails/production.log
- 结构化分析：将 production_json.log 接入 ELK/Graylog/Splunk，按 controller/action、status、duration_s、correlation_id 聚合与下钻。
日志级别调整：
- 全局：设置环境变量 GITLAB_LOG_LEVEL=info（或 0–5），多数 logger 默认 DEBUG。
- 组件级：如 Sidekiq 使用 SIDEKIQ_LOG_LEVEL，Gitaly gRPC 使用 GRPC_LOG_LEVEL 等。
轮转与保留：
- 调整 /etc/gitlab/gitlab.rb 中 logrotate 策略（如频率、保留份数），执行 gitlab-ctl reconfigure 生效；必要时手动触发 gitlab-ctl rotate-logs。
- 由 svlogd 管理的服务（如 PostgreSQL、Redis、Prometheus）轮转策略在其各自目录的 log/run 配置中。

五常见问题与排查路径

502/504 或首页打开慢：
- 查 gitlab-rails/production_json.log 的 status、duration_s、db_duration_s、gitaly_calls/duration_s、redis_calls/duration_s；用 correlation_id 串联 Nginx、Workhorse、Puma、Gitaly、DB。
- 查 puma、workhorse、nginx 错误日志与 systemd 服务状态（gitlab-ctl status；必要时 journalctl -u gitlab-runsvdir）。
Sidekiq 堆积与任务积压：
- 查 sidekiq/current 与 gitlab-rails/production_json.log 中 sidekiq 相关日志；结合 Prometheus 观察 sidekiq_queue_size/sidekiq_job_duration_seconds。
仓库克隆/拉取慢：
- 查 gitlab-shell.log 与 gitaly 指标/日志，关注 gitaly_calls/duration_s 与网络/磁盘。
磁盘告警：
- 查 /var/log/gitlab 各目录占用，核对 logrotate 是否生效；清理历史归档与无用容器/镜像。
安全与合规：
- 审计关键操作（企业版），对 /var/log/gitlab 与集中日志平台设置 访问控制 与 保留策略。

0 赞

0 踩