Prerequisites
Before integrating Hadoop and Spark on Debian, ensure you have:
sudo apt update && sudo apt install openjdk-11-jdk, and verify with java -version.hadoop) to enable cluster communication. Generate keys with ssh-keygen -t rsa and copy to authorized_keys.1. Install and Configure Hadoop
Download Hadoop (e.g., 3.3.6) from the Apache website and extract it to /opt:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xzvf hadoop-3.3.6.tar.gz -C /opt
ln -s /opt/hadoop-3.3.6 /opt/hadoop # Create a symbolic link for easy access
Set environment variables in /etc/profile:
echo "export HADOOP_HOME=/opt/hadoop" >> /etc/profile
echo "export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin" >> /etc/profile
source /etc/profile
Configure core Hadoop files in $HADOOP_HOME/etc/hadoop:
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value></property>
<property><name>hadoop.tmp.dir</name><value>/opt/hadoop/tmp</value></property>
</configuration>
<configuration>
<property><name>dfs.replication</name><value>1</value></property>
<property><name>dfs.namenode.name.dir</name><value>/opt/hadoop/hdfs/namenode</value></property>
<property><name>dfs.datanode.data.dir</name><value>/opt/hadoop/hdfs/datanode</value></property>
</configuration>
<configuration>
<property><name>mapreduce.framework.name</name><value>yarn</value></property>
</configuration>
<configuration>
<property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property>
<property><name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property>
<property><name>yarn.resourcemanager.hostname</name><value>localhost</value></property>
</configuration>
Format HDFS (only once) and start services:
hdfs namenode -format
start-dfs.sh # Start HDFS
start-yarn.sh # Start YARN
Verify with hdfs dfsadmin -report (check DataNodes) and yarn node -list (check NodeManagers).
2. Install and Configure Spark
Download Spark (e.g., 3.3.2) pre-built for Hadoop (e.g., spark-3.3.2-bin-hadoop3.tgz) and extract it to /opt:
wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
tar -xzvf spark-3.3.2-bin-hadoop3.tgz -C /opt
ln -s /opt/spark-3.3.2-bin-hadoop3 /opt/spark # Symbolic link
Set environment variables in /etc/profile:
echo "export SPARK_HOME=/opt/spark" >> /etc/profile
echo "export PATH=\$PATH:\$SPARK_HOME/bin:\$SPARK_HOME/sbin" >> /etc/profile
source /etc/profile
Configure Spark to integrate with Hadoop:
SPARK_DIST_CLASSPATH.echo "export HADOOP_CONF_DIR=\$HADOOP_HOME/etc/hadoop" >> \$SPARK_HOME/conf/spark-env.sh
echo "export SPARK_DIST_CLASSPATH=\$(\$HADOOP_HOME/bin/hadoop classpath)" >> \$SPARK_HOME/conf/spark-env.sh
spark.master yarn
spark.hadoop.fs.defaultFS hdfs://localhost:9000
spark.eventLog.enabled true
spark.eventLog.dir hdfs://localhost:9000/spark-logs
Start Spark’s master and worker nodes:
start-master.sh # Start Spark Master (accessible at http://localhost:8080)
start-slave.sh spark://localhost:7077 # Start Spark Worker
3. Integrate Hadoop and Spark
The key to integration is ensuring Spark can access Hadoop’s resources (HDFS, YARN). The above configurations achieve this by:
HADOOP_CONF_DIR).SPARK_DIST_CLASSPATH).spark.master yarn) and HDFS as the default FS (spark.hadoop.fs.defaultFS).To validate integration, run a Spark job that reads/writes data from HDFS:
# Example: Count words in an HDFS file
/opt/spark/bin/run-example SparkPi 10 # Run a sample Spark job
/opt/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
/opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar 10
Troubleshooting Tips
$HADOOP_HOME/logs or $SPARK_HOME/logs.spark-3.3.2-bin-hadoop3 for Hadoop 3.x)./user/hadoop) exist and have correct permissions.