http://<Master节点IP>:16010/master-status,可查看集群状态、RegionServer列表、Region分布、表信息及核心指标(如读写请求量、延迟)。status 'detailed'查看集群详细状态,hbase shell中的table_help、region_count等命令可监控表级别信息。hbase-env.sh中的HBASE_JMX_BASE="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=10101 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"),通过jconsole或VisualVM实时查看JVM内存、线程、GC及HBase内部指标(如BlockCache命中率、Compaction队列长度)。Prometheus + Grafana(推荐):
hbase-exporter(HBase官方Metrics Exporter)收集HBase指标(如hbase_regionserver_storefile_index_size、hbase_regionserver_compaction_queue_length)。4791),可视化展示集群状态、RegionServer负载、存储使用等指标。hbase-exporter,修改hbase-exporter.yml中的hbase.master和hbase.regionserver地址。prometheus.yml,添加hbase_exporter的job(scrape_configs中增加- job_name: 'hbase' targets: ['<hbase-exporter-ip>:9100'])。Zabbix(企业级监控):
zabbix_agentd.conf(添加UserParameter=hbase.regionserver.live*,/usr/bin/hbase shell "status 'simple'" | grep -c LIVE),在Zabbix Web端创建“HBase Cluster”模板,关联RegionServer节点,设置监控项(如hbase.regionserver.requests)和触发器(如请求量超过1000次/秒触发告警)。Nagios(传统监控):
check_hbase.py等插件监控HBase服务可用性(如HMaster、RegionServer进程是否运行)、RegionServer负载(如读写延迟)。/usr/local/nagios/libexec/,在services.cfg中添加服务检查(define service { use generic-service host_name hbase-master service_description HBase Master check_command check_hbase_master!/usr/local/nagios/libexec/check_hbase.py })。hbase_rules.yml,定义触发条件(如HBase节点宕机、Compaction队列过长)。示例:groups:
- name: hbase_alerts
rules:
- alert: HBaseNodeDown
expr: up{job="hbase"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "HBase node {{ $labels.instance }} is down"
description: "HBase node {{ $labels.instance }} has been down for more than 1 minute."
- alert: CompactionQueueTooLong
expr: hbase_regionserver_compaction_queue_length > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Compaction queue too long on {{ $labels.instance }}"
description: "Compaction queue length is {{ $value }} on {{ $labels.instance }}, exceeding threshold of 100."
prometheus.yml中添加告警规则文件路径:rule_files:
- "rules/hbase_rules.yml"
alertmanager.yml设置通知渠道(如邮件、Slack):route:
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: 'admin@example.com'
from: 'alertmanager@example.com'
smtp_smarthost: 'smtp.example.com:587'
smtp_auth_username: 'user@example.com'
smtp_auth_password: 'password'
{hbase-regionserver.requests.avg(5m)}>1000),选择通知方式(邮件、短信、企业微信),关联对应用户组。commands.cfg中定义通知命令(define command { command_name notify-by-email command_line '/usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nService: $SERVICEDESC$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n$SERVICEOUTPUT$\n" | /usr/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $SERVICESTATE$ **" $CONTACTEMAIL$' }),在contacts.cfg中添加联系人邮箱,触发告警时自动发送邮件。hbase-regionserver.log),通过关键字(如“ERROR”“Exception”)触发告警,快速定位问题。