您好,登录后才能下订单哦!
密码登录
登录注册
点击 登录注册 即表示同意《亿速云用户服务条款》
# Hadoop搭建及WordCount实例运行分析
## 一、Hadoop概述
### 1.1 Hadoop简介
Hadoop是由Apache基金会开发的分布式系统基础架构,核心设计包含:
- **HDFS**(Hadoop Distributed File System):分布式文件存储系统
- **MapReduce**:分布式计算框架
- **YARN**:资源调度管理系统
### 1.2 核心优势
- 高容错性:自动维护数据多副本
- 高扩展性:可部署在廉价硬件上
- 高效性:并行处理PB级数据
- 高可靠性:自动故障转移
## 二、Hadoop环境搭建
### 2.1 基础环境准备
**硬件要求:**
- 至少3节点(1主2从)
- 每节点4GB内存+50GB磁盘
- 千兆网络连接
**软件要求:**
- JDK 1.8+
- SSH无密码登录
- Linux系统(推荐CentOS/Ubuntu)
### 2.2 详细安装步骤
#### 2.2.1 系统配置
```bash
# 关闭防火墙
systemctl stop firewalld
systemctl disable firewalld
# 设置主机名
hostnamectl set-hostname master
hostnamectl set-hostname slave1
hostnamectl set-hostname slave2
# 配置hosts文件
echo "192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2" >> /etc/hosts
tar -zxvf jdk-8u341-linux-x64.tar.gz -C /usr/local/
echo 'export JAVA_HOME=/usr/local/jdk1.8.0_341
export PATH=$PATH:$JAVA_HOME/bin' >> /etc/profile
source /etc/profile
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz
tar -zxvf hadoop-3.3.4.tar.gz -C /usr/local/
mv /usr/local/hadoop-3.3.4 /usr/local/hadoop
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/tmp</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/data/hadoop/namenode</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
# 格式化HDFS
hdfs namenode -format
# 启动服务
start-dfs.sh
start-yarn.sh
# 验证服务
jps
# 应显示:NameNode/DataNode/ResourceManager/NodeManager
WordCount是MapReduce的”Hello World”,处理流程:
1. InputSplit:将输入文件分片
2. Map阶段:提取
public class WordCount {
// Mapper实现
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
// Reducer实现
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
echo "Hello World Hello Hadoop" > input.txt
hdfs dfs -mkdir /input
hdfs dfs -put input.txt /input
hadoop jar wordcount.jar WordCount /input /output
hdfs dfs -cat /output/part-r-00000
# 输出示例:
# Hadoop 1
# Hello 2
# World 1
<!-- mapred-site.xml优化示例 -->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>200</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
组件 | 占比 | 说明 |
---|---|---|
Map任务 | 40-60% | CPU密集型任务 |
Reduce任务 | 20-30% | 需要更多网络资源 |
系统预留 | 10-20% | OS和其他服务 |
~/.ssh/authorized_keys
权限应为600ssh master
能否无密码登录/usr/local/hadoop/logs/hadoop-*-namenode-*.log
# 查看任务状态
yarn application -list
# 杀死任务
yarn application -kill application_123456789_0001
Hadoop作为大数据生态基石,其核心价值体现在: 1. 实现了廉价硬件的规模化计算 2. 提供了可靠的数据存储方案 3. 开创了分布式计算范式
未来发展趋势: - 与云原生技术融合(Kubernetes调度) - 实时计算能力增强(Flink集成) - 机器学习生态完善(TensorFlow on YARN)
注:本文基于Hadoop 3.3.4版本验证,完整实验代码和配置文件可参考GitHub示例仓库 “`
(全文约2850字,实际字数可能因格式调整略有变化)
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。