Debian上Hadoop开发工具及环境配置
在Debian系统上进行Hadoop开发,需先搭建基础的Hadoop运行环境,再借助各类工具提升开发效率。以下是核心工具及配置说明:
Hadoop依赖Java运行环境和SSH无密码登录,需通过以下命令配置:
JAVA_HOME环境变量。sudo apt update && sudo apt install -y openjdk-11-jdk
echo "export JAVA_HOME=$(readlink -f /usr/bin/javac | sed 's:/bin/javac::')" >> ~/.bashrc
echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> ~/.bashrc
source ~/.bashrc
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
以上配置是Hadoop运行的前提,确保系统具备Java运行能力和节点间通信权限。
Hadoop自带一组命令行工具,用于管理集群、操作HDFS及提交MapReduce任务:
hdfs dfs命令用于文件上传(-put)、下载(-get)、删除(-rm)等,例如将本地文件上传至HDFS:hdfs dfs -put local_file.txt /user/hadoop/input/
mapred命令用于提交和管理MapReduce作业,例如提交Python编写的Streaming任务:mapred streaming -input /user/hadoop/input/ -output /user/hadoop/output/ -mapper "python mapper.py" -reducer "python reducer.py"
yarn命令用于查看应用程序状态(application -list)、杀死任务(application -kill)等,例如查看当前运行的YARN应用:yarn application -list
这些工具是Hadoop开发的底层操作接口,适用于脚本化管理和调试。
Hadoop提供多种开发工具包,支持不同编程语言编写分布式应用:
hadoop-streaming.jar(位于$HADOOP_HOME/share/hadoop/tools/lib/),例如Python MapReduce任务提交命令:hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming.jar \
-input /input/data.txt -output /output/result \
-mapper "python mapper.py" -reducer "python reducer.py"
HadoopPipes::Mapper和HadoopPipes::Reducer类,实现map和reduce函数,例如:#include "hadoop/Pipes.hh"
class MyMapper : public HadoopPipes::Mapper {
public:
void map(HadoopPipes::MapContext& context) {
std::string line = context.getInputValue();
// 处理逻辑
context.emit(key, value);
}
};
int main(int argc, char** argv) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MyMapper>());
}
这些工具包扩展了Hadoop的开发能力,支持非Java语言的分布式应用开发。
通过IDE插件可提升Hadoop代码编写、调试效率,常用插件包括:
Hadoop Eclipse Plugin,支持HDFS文件浏览、MapReduce任务调试及集群配置管理。Hadoop Support插件,实现Hadoop代码自动补全、语法检查和远程调试。Hadoop扩展,提供HDFS文件操作、YARN任务管理和日志查看功能。除核心工具外,Hadoop生态中的其他组件也可辅助开发:
CREATE TABLE logs (ip STRING, time STRING, url STRING);
LOAD DATA INPATH '/input/logs.txt' INTO TABLE logs;
SELECT ip, COUNT(*) FROM logs GROUP BY ip;
logs = LOAD '/input/logs.txt' AS (ip:chararray, time:chararray, url:chararray);
grouped = GROUP logs BY ip;
counts = FOREACH grouped GENERATE group, COUNT(logs);
STORE counts INTO '/output/ip_counts';
sqoop import --connect jdbc:mysql://localhost/mydb --table logs --target-dir /input/logs
这些工具降低了Hadoop开发的复杂度,适用于数据仓库、ETL等场景。