Configuring a reliable network is critical for a Hadoop cluster to ensure seamless communication between nodes (NameNode, DataNodes, ResourceManager, NodeManagers). Below is a structured guide tailored for Ubuntu systems, covering static IP setup, hostname configuration, hosts file modification, SSH setup, and essential Hadoop network configurations.
Hadoop requires stable IP addresses for cluster nodes. Use Netplan (Ubuntu’s default network configuration tool) to configure static IPs.
ip a to list all interfaces (e.g., ens33)./etc/netplan/01-netcfg.yaml (filename may vary) in a text editor (e.g., sudo nano /etc/netplan/01-netcfg.yaml).ens33, 192.168.1.100, 255.255.255.0, 192.168.1.1, and 8.8.8.8 with your interface name, desired IP, subnet mask, gateway, and DNS server respectively.network:
version: 2
renderer: networkd
ethernets:
ens33:
dhcp4: no
addresses: [192.168.1.100/24]
gateway4: 192.168.1.1
nameservers:
addresses: [8.8.8.8, 8.8.4.4]
sudo netplan apply to activate the new configuration.ip a to confirm the static IP is assigned.Each node in the cluster should have a unique hostname (e.g., namenode, datanode1). This helps identify nodes in logs and commands.
sudo hostnamectl set-hostname <your_hostname> (e.g., sudo hostnamectl set-hostname namenode)./etc/hostname and replace the existing content with <your_hostname>.sudo reboot to apply the hostname change.The /etc/hosts file maps IP addresses to hostnames, enabling name-based communication between nodes (avoids relying on DNS).
/etc/hosts in a text editor (e.g., sudo nano /etc/hosts).192.168.1.100 namenode
192.168.1.101 datanode1
192.168.1.102 datanode2
scp (e.g., scp /etc/hosts user@datanode1:/etc/hosts).Hadoop requires secure, passwordless communication between nodes (e.g., for NameNode to manage DataNodes). Use SSH keys to achieve this.
ssh-keygen -t rsa. Press Enter to accept default paths and skip passphrase (for automation).ssh-copy-id user@remote_node_ip (e.g., ssh-copy-id user@datanode1) to add the public key to the ~/.ssh/authorized_keys file of each remote node.ssh user@remote_node_ip (e.g., ssh user@datanode1). You should log in without entering a password.Hadoop’s configuration files define how it interacts with the network. Key files include core-site.xml, hdfs-site.xml, and mapred-site.xml.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode:9000</value> <!-- Replace "namenode" with the NameNode's hostname -->
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value> <!-- Number of replicas for each data block -->
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/path/to/namenode/dir</value> <!-- Local directory for NameNode metadata -->
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/path/to/datanode/dir</value> <!-- Local directory for DataNode data storage -->
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>resourcemanager</value> <!-- Replace with ResourceManager's hostname -->
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value> <!-- Memory allocated to each NodeManager (in MB) -->
</property>
</configuration>
~/.bashrc (or /etc/profile for system-wide access):export HADOOP_HOME=/usr/local/hadoop <!-- Replace with your Hadoop installation path -->
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 <!-- Replace with your Java path -->
Apply changes with source ~/.bashrc.Firewalls and SELinux can block Hadoop’s network ports (e.g., 9000 for HDFS, 8088 for YARN). Disable them to avoid connectivity issues.
sudo systemctl stop ufw
sudo systemctl disable ufw
/etc/selinux/config and set SELINUX=disabled. Reboot the system to apply changes.After completing the above steps, verify that all nodes can communicate and Hadoop services start correctly.
ping datanode1 (replace with a DataNode’s hostname/IP). Ensure there is no packet loss.hdfs dfsadmin -report on the NameNode. You should see all DataNodes listed.yarn node -list on the ResourceManager. You should see all NodeManagers listed.By following these steps, you’ll establish a robust network foundation for your Ubuntu-based Hadoop cluster, ensuring reliable communication between nodes and optimal performance for distributed data processing.