在 Debian 上使用 HDFS 进行大数据分析的完整路径
一 架构与准备
二 在 Debian 上部署 Hadoop 与 HDFS
sudo apt update && sudo apt install -y openjdk-8-jdkjava -versionwget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gzsudo tar -xzvf hadoop-3.3.1.tar.gz -C /usr/local/sudo ln -s /usr/local/hadoop-3.3.1 /usr/local/hadoop~/.bashrc:export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source ~/.bashrc<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/dfs/data</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
/etc/hosts 映射各节点 IP ↔ 主机名hdfs namenode -formatstart-dfs.shstart-yarn.shjps(应看到 NameNode/DataNode/ResourceManager/NodeManager 等)hdfs dfs -mkdir -p /data/inputhdfs dfs -put local.txt /data/input/hdfs dfs -ls /data/input三 运行示例作业与提交方式
hdfs dfs -mkdir -p /data/input && hdfs dfs -put $HADOOP_HOME/README.txt /data/input/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
wordcount /data/input /data/output
hdfs dfs -cat /data/output/part-r-00000/usr/local/spark,配置 SPARK_HOME 与 PATH$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
$SPARK_HOME/examples/src/main/python/wordcount.py \
hdfs://namenode:9000/data/input hdfs://namenode:9000/data/spark-out
四 性能与资源管理优化
五 常见问题与排查
/etc/hosts、防火墙与 SSH 免密登录 配置jps 确认进程;检查 9870/8088 端口连通性hdfs namenode -format,并检查 dfs.namenode.name.dir/dfs.datanode.data.dir 目录权限与磁盘空间