在Debian上集成Hadoop与Spark涉及多个步骤,包括安装必要的软件包、配置环境变量、修改配置文件以及启动服务。以下是详细的步骤:
Hadoop依赖于Java,因此首先需要安装JDK。
sudo apt update
sudo apt install openjdk-11-jdk
验证Java安装:
java -version
下载并解压Hadoop安装包。例如,安装Hadoop 3.3.6:
wget https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6-src.tar.gz.sha512
sha256sum -c hadoop-3.3.6-src.tar.gz.sha512
sudo mkdir /opt/hadoop
sudo tar -xzvf hadoop-3.3.6.tar.gz -C /opt/hadoop --strip-components 1
编辑Hadoop配置文件:
core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/hadoop/hdfs/namenode</value>
</property>
</configuration>
mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
</configuration>
为Hadoop用户生成SSH密钥,并将公钥复制到authorized_keys文件中:
sudo su - hadoop
ssh-keygen -t rsa -P ''
cat /.ssh/id_rsa.pub >> /.ssh/authorized_keys
chmod 600 /.ssh/authorized_keys
测试SSH连接:
ssh localhost
格式化NameNode:
hdfs namenode -format
启动Hadoop服务:
/opt/hadoop/hadoop/sbin/start-dfs.sh
/opt/hadoop/hadoop/sbin/start-yarn.sh
验证Hadoop服务状态:
HDFS状态:
hdfs dfsadmin -report
YARN资源管理器状态:
curl http://localhost:8088/cluster/scheduler
下载并解压Spark安装包。例如,安装Spark 3.3.2:
wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
tar -xzvf spark-3.3.2-bin-hadoop3.tgz -C /opt
ln -s /opt/spark-3.3.2 /opt/spark
修改Spark配置文件:
spark-defaults.conf:
spark.master yarn
spark.executor.memory 4g
spark.driver.memory 4g
spark-env.sh:
export HADOOP_CONF_DIR=/opt/hadoop/hadoop/etc/hadoop
export HADOOP_HOME=/opt/hadoop
export SPARK_DIST_CLASSPATH=$HADOOP_HOME/jars/*
启动Spark集群:
./sbin/start-master.sh
./sbin/start-slave.sh
验证Spark服务状态:
访问Spark Web UI:
http://localhost:8080
请注意,具体的配置步骤可能会根据实际使用的Hadoop和Spark版本有所不同。建议参考官方文档以获取详细的配置指南。