Prerequisites
Before installing HDFS on Debian, ensure your system is up-to-date and install essential tools:
sudo apt update && sudo apt upgrade -y
sudo apt install wget ssh vim -y
These commands update package lists, upgrade installed packages, and install wget (for downloading Hadoop), ssh (for remote access), and vim (for configuration editing).
1. Install Java Environment
Hadoop requires Java 8 or higher. Install OpenJDK 11 (recommended for compatibility):
sudo apt install openjdk-11-jdk -y
Verify the installation:
java -version
You should see output indicating OpenJDK 11 is installed.
2. Create a Dedicated Hadoop User
For security and isolation, create a non-root user (e.g., hadoop) and add it to the sudo group:
sudo adduser hadoop
sudo usermod -aG sudo hadoop
Switch to the new user:
su - hadoop
This user will manage all Hadoop operations.
3. Download and Extract Hadoop
Download the latest stable Hadoop release (e.g., 3.3.6) from the Apache website:
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Extract the archive to /usr/local/ and rename the directory for simplicity:
sudo tar -xzvf hadoop-3.3.6.tar.gz -C /usr/local/
sudo mv /usr/local/hadoop-3.3.6 /usr/local/hadoop
Change ownership of the Hadoop directory to the hadoop user:
sudo chown -R hadoop:hadoop /usr/local/hadoop
4. Configure Environment Variables
Set up Hadoop-specific environment variables in /etc/profile (system-wide) or ~/.bashrc (user-specific). Open the file with vim:
vim ~/.bashrc
Add the following lines at the end:
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64  # Adjust if using a different Java version
Load the changes into the current session:
source ~/.bashrc
Verify the variables are set:
echo $HADOOP_HOME  # Should output /usr/local/hadoop
5. Configure SSH Passwordless Login
Hadoop requires passwordless SSH between the NameNode and DataNodes. Generate an SSH key pair:
ssh-keygen -t rsa -b 4096 -C "hadoop@debian"
Press Enter to accept default file locations and skip passphrase entry. Copy the public key to the local machine (for single-node clusters) or other cluster nodes:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
Test passwordless login:
ssh localhost
You should log in without entering a password.
6. Configure Hadoop Core Files
Navigate to the Hadoop configuration directory:
cd $HADOOP_HOME/etc/hadoop
Edit the following files to define HDFS behavior:
core-site.xml: Sets the default file system (HDFS) and NameNode address.
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode:9000</value>  <!-- Replace 'namenode' with your NameNode's hostname/IP -->
    </property>
</configuration>
hdfs-site.xml: Configures replication factor (for fault tolerance) and data directories.
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>  <!-- Set to 3 for multi-node clusters; 1 for single-node -->
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop/hdfs/namenode</value>  <!-- Create this directory later -->
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop/hdfs/datanode</value>  <!-- Create this directory later -->
    </property>
</configuration>
mapred-site.xml: Specifies the MapReduce framework (YARN).
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
yarn-site.xml: Configures YARN resource management.
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>
7. Create HDFS Data Directories
Create the directories specified in hdfs-site.xml for NameNode and DataNode storage:
sudo mkdir -p /opt/hadoop/hdfs/namenode
sudo mkdir -p /opt/hadoop/hdfs/datanode
sudo chown -R hadoop:hadoop /opt/hadoop  # Change ownership to the hadoop user
8. Format the NameNode
The NameNode must be formatted once before starting HDFS. Run this command carefully (it will erase existing HDFS data):
hdfs namenode -format
You should see output indicating successful formatting.
9. Start HDFS Services
Start the HDFS daemons (NameNode and DataNode) using the start-dfs.sh script:
$HADOOP_HOME/sbin/start-dfs.sh
Check the status of HDFS processes with jps:
jps
You should see NameNode and DataNode running (along with other Java processes).
10. Verify HDFS Installation
Use HDFS commands to confirm the cluster is operational:
hdfs dfs -ls /
hdfs dfs -mkdir -p /user/hadoop/input
echo "Hello, HDFS!" > test.txt
hdfs dfs -put test.txt /user/hadoop/input/
hdfs dfs -cat /user/hadoop/input/test.txt
You should see the output Hello, HDFS!
Troubleshooting Tips
JAVA_HOME is correctly set in $HADOOP_HOME/etc/hadoop/hadoop-env.sh.chown to ensure the hadoop user owns all Hadoop-related directories.