在Ubuntu上运行Hadoop作业涉及多个步骤,包括安装和配置Hadoop环境、编写和提交Hadoop作业、以及监控作业的执行。以下是详细的步骤指南:
sudo apt-get update
sudo apt-get install openjdk-8-jdk
java -version
wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -zxvf hadoop-3.3.4.tar.gz -C /opt/
sudo mv /opt/hadoop-3.3.4 /opt/hadoop
echo 'export HADOOP_HOME=/opt/hadoop' >> ~/.bashrc
echo 'export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> ~/.bashrc
source ~/.bashrc
/opt/hadoop/etc/hadoop/hadoop-env.sh
,设置JAVA_HOME
环境变量。/opt/hadoop/etc/hadoop/core-site.xml
,配置HDFS参数。/opt/hadoop/etc/hadoop/hdfs-site.xml
,配置HDFS参数。/opt/hadoop/etc/hadoop/yarn-site.xml
,配置YARN参数。/opt/hadoop/etc/hadoop/mapred-site.xml
,配置MapReduce参数。hdfs namenode -format
start-dfs.sh
start-yarn.sh
Hadoop作业通常以Java编写,但也可以使用其他支持的编程语言如Python(通过Hadoop Streaming)。以下是一个简单的MapReduce作业示例:
Mapper (mapper.py
):
#!/usr/bin/env python
import sys
for line in sys.stdin:
print(line.strip().split()[0], '1')
Reducer (reducer.py
):
#!/usr/bin/env python
import sys
current_word = None
current_count = 0
for line in sys.stdin:
word, count = line.strip().split('\t')
if current_word == word:
current_count += int(count)
else:
if current_word:
print(current_word, current_count)
current_word = word
current_count = int(count)
if current_word == word:
print(current_word, current_count)
zip my_job.zip mapper.py reducer.py
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar wordcount my_job.zip input_directory output_directory
其中,input_directory
是作业要处理的数据目录,output_directory
是输出目录。
使用YARN资源管理器Web界面(http://localhost:8088)监控作业的执行状态和性能。
通过以上步骤,您可以在Ubuntu上成功运行Hadoop作业。请根据具体需求和环境调整配置和作业代码。