一、前期准备
sudo yum install java-1.8.0-openjdk-devel,配置JAVA_HOME环境变量(/etc/profile.d/java.sh)并生效。wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz,解压至/opt/,配置HADOOP_HOME环境变量(/etc/profile.d/hadoop.sh)并生效。ssh-keygen -t rsa),并将公钥(id_rsa.pub)追加到~/.ssh/authorized_keys,实现无密码远程操作。二、配置JournalNode(共享编辑日志存储) JournalNode用于存储NameNode的edits log(操作日志),确保Active与Standby NameNode的元数据同步。
hdfs-site.xml中添加JournalNode配置:<property>
<name>dfs.journalnode.edits.dir</name>
<value>/var/hadoop/hdfs/journal</value> <!-- 日志存储路径,需提前创建并授权(chown -R hadoop:hadoop /var/hadoop/hdfs/journal) -->
</property>
hadoop-daemon.sh start journalnode,通过jps命令检查JournalNode进程是否运行。三、配置ZooKeeper(故障转移协调) ZooKeeper用于监控NameNode状态,触发自动故障转移(ZKFC,ZooKeeper Failover Controller)。
mkdir -p /var/lib/zookeeper。zoo.cfg(/opt/zookeeper/conf/zoo.cfg):tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
clientPort=2181
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888
dataDir下创建myid文件(内容为节点ID,如zoo1节点写入1)。zkServer.sh start,通过zkServer.sh status检查Leader状态。四、配置HDFS高可用(NameNode与ZKFC)
core-site.xml(全局配置):<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value> <!-- 集群名称,需与hdfs-site.xml中的dfs.nameservices一致 -->
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>zoo1:2181,zoo2:2181,zoo3:2181</value> <!-- ZooKeeper集群地址 -->
</property>
hdfs-site.xml(HDFS高可用核心配置):<property>
<name>dfs.nameservices</name>
<value>mycluster</value> <!-- 集群名称 -->
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value> <!-- NameNode标识(逗号分隔) -->
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>namenode1:8020</value> <!-- Active NameNode RPC地址 -->
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>namenode2:8020</value> <!-- Standby NameNode RPC地址 -->
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>namenode1:50070</value> <!-- Active NameNode HTTP地址 -->
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>namenode2:50070</value> <!-- Standby NameNode HTTP地址 -->
</property>
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> <!-- 客户端故障转移代理 -->
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value> <!-- 故障转移隔离方法(防止脑裂) -->
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value> <!-- SSH私钥路径(用于隔离) -->
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://journalnode1:8485;journalnode2:8485;journalnode3:8485/mycluster</value> <!-- JournalNode共享日志地址 -->
</property>
hdfs namenode -format,初始化元数据。hadoop-daemon.sh start journalnode,确保服务正常。hadoop-daemon.sh start namenode,格式化完成后,将元数据同步到Standby NameNode(hdfs namenode -bootstrapStandby)。hadoop-daemon.sh start zkfc,ZKFC会监控NameNode状态,自动触发故障转移。hadoop-daemon.sh start datanode,注册到NameNode。五、验证高可用性
hdfs dfsadmin -report,查看NameNode状态(应显示active和standby)。hadoop-daemon.sh stop namenode)。hdfs dfsadmin -report或Web界面http://namenode1:50070查看)。hdfs dfs -put命令上传文件,停止Active NameNode后,再次上传,验证客户端是否自动连接到Standby NameNode。六、后续维护
hdfs namenode -backup)和JournalNode数据,防止数据丢失。kinit命令获取票据),配置HDFS权限(hdfs dfs -chmod),限制防火墙端口(仅开放必要端口,如8020、50070、2181)。