Debian中Hadoop与Spark集成方法 - 问答

Prerequisites
Before integrating Hadoop and Spark on Debian, ensure you have:

Java JDK 8/11 (required by both frameworks): Install via sudo apt update && sudo apt install openjdk-11-jdk, and verify with java -version.
SSH Access: Set up passwordless SSH for the Hadoop/Spark user (e.g., hadoop) to enable cluster communication. Generate keys with ssh-keygen -t rsa and copy to authorized_keys.

1. Install and Configure Hadoop
Download Hadoop (e.g., 3.3.6) from the Apache website and extract it to /opt:

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xzvf hadoop-3.3.6.tar.gz -C /opt
ln -s /opt/hadoop-3.3.6 /opt/hadoop  # Create a symbolic link for easy access

Set environment variables in /etc/profile:

echo "export HADOOP_HOME=/opt/hadoop" >> /etc/profile
echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >> /etc/profile
source /etc/profile

Configure core Hadoop files in $HADOOP_HOME/etc/hadoop:

core-site.xml: Define the default file system (HDFS) and temporary directory.

<configuration>
  <property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>
  <property><name>hadoop.tmp.dir</name><value>/opt/hadoop/tmp</value></property>
</configuration>

hdfs-site.xml: Set HDFS replication (1 for local dev) and NameNode/data directories.

<configuration>
  <property><name>dfs.replication</name><value>1</value></property>
  <property><name>dfs.namenode.name.dir</name><value>/opt/hadoop/hdfs/namenode</value></property>
  <property><name>dfs.datanode.data.dir</name><value>/opt/hadoop/hdfs/datanode</value></property>
</configuration>

mapred-site.xml: Specify YARN as the MapReduce framework.

<configuration>
  <property><name>mapreduce.framework.name</name><value>yarn</value></property>
</configuration>

yarn-site.xml: Enable shuffle service and set ResourceManager hostname.

<configuration>
  <property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property>
  <property><name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property>
  <property><name>yarn.resourcemanager.hostname</name><value>localhost</value></property>
</configuration>

Format HDFS (only once) and start services:

hdfs namenode -format
start-dfs.sh  # Start HDFS
start-yarn.sh  # Start YARN

Verify with hdfs dfsadmin -report (check DataNodes) and yarn node -list (check NodeManagers).

2. Install and Configure Spark
Download Spark (e.g., 3.3.2) pre-built for Hadoop (e.g., spark-3.3.2-bin-hadoop3.tgz) and extract it to /opt:

wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
tar -xzvf spark-3.3.2-bin-hadoop3.tgz -C /opt
ln -s /opt/spark-3.3.2-bin-hadoop3 /opt/spark  # Symbolic link

Set environment variables in /etc/profile:

echo "export SPARK_HOME=/opt/spark" >> /etc/profile
echo "export PATH=\$PATH:\$SPARK_HOME/bin:\$SPARK_HOME/sbin" >> /etc/profile
source /etc/profile

Configure Spark to integrate with Hadoop:

spark-env.sh: Link to Hadoop’s config directory and add Hadoop jars to SPARK_DIST_CLASSPATH.

echo "export HADOOP_CONF_DIR=\$HADOOP_HOME/etc/hadoop" >> \$SPARK_HOME/conf/spark-env.sh
echo "export SPARK_DIST_CLASSPATH=\$(\$HADOOP_HOME/bin/hadoop classpath)" >> \$SPARK_HOME/conf/spark-env.sh

spark-defaults.conf: Set default file system to HDFS and scheduler to YARN.

spark.master yarn
spark.hadoop.fs.defaultFS hdfs://localhost:9000
spark.eventLog.enabled true
spark.eventLog.dir hdfs://localhost:9000/spark-logs

Start Spark’s master and worker nodes:

start-master.sh  # Start Spark Master (accessible at http://localhost:8080)
start-slave.sh spark://localhost:7077  # Start Spark Worker

3. Integrate Hadoop and Spark
The key to integration is ensuring Spark can access Hadoop’s resources (HDFS, YARN). The above configurations achieve this by:

Pointing Spark to Hadoop’s config files (HADOOP_CONF_DIR).
Adding Hadoop’s classpath to Spark (SPARK_DIST_CLASSPATH).
Configuring Spark to use YARN as the cluster manager (spark.master yarn) and HDFS as the default FS (spark.hadoop.fs.defaultFS).

To validate integration, run a Spark job that reads/writes data from HDFS:

# Example: Count words in an HDFS file
/opt/spark/bin/run-example SparkPi 10  # Run a sample Spark job
/opt/spark/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode client \
  /opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar 10

Troubleshooting Tips

Check Logs: If services fail to start, inspect logs in $HADOOP_HOME/logs or $SPARK_HOME/logs.
Version Compatibility: Use Spark versions pre-built for your Hadoop version (e.g., spark-3.3.2-bin-hadoop3 for Hadoop 3.x).
Firewall: Ensure ports (9000 for HDFS, 8088 for YARN, 8080 for Spark UI) are open.
Permissions: Verify HDFS directories (e.g., /user/hadoop) exist and have correct permissions.

0 赞

0 踩