Hadoop Virtual Cluster Installation

Data scientists or others getting into big data analytics are from a wide range of backgrounds and hence many a time lack computer networking fundamentals to install a Hadoop cluster . Although, it would not make a difference for most data science profiles such as  those designing data research experiments or building data systems, as someone called a Data Engineer with a good background in databases and network engineering would handle Hadoop interactions. However, data scientists are expected to know how to deal with large volumes of data by employers, and Hadoop being synonymous with big data analysis, a working knowledge of a Hadoop cluster would be handy.

In order to experiment with Hadoop, there are sandbox environments, which are free of cost, provided by most major distributors, such as Cloudera or Hortonworks. These can be installed as a single virtual machine on a virtualization software such as VMware or VirtualBox on a desktop. Sandbox is essentially a pseudo cluster giving the feel of a cluster for development purposes, and installation is simple.

There are 2 basic steps to install the ‘sandbox’ –

  1. Install Virtualization Software  (VMware/VirtualBox)
  2. Download an install the Sandbox from the distributor (Cloudera/Hortonworks/MapR)

One such post demonstrating these steps – http://www.hadoopwizard.com/install-hadoop-on-windows-in-3-easy-steps-for-hortonworks-sandbox-tutorial/


If however, one feels the need to go a bit further and install an entire virtual cluster things can get difficult with network engineering basics.

There are many guides and posts on this, however a large part of them do not explain the basic requirement of a cluster which is getting a bunch of computers to talk to each other. This is actually not that difficult once a person understands the basics, however this post does not  get into these issues. This is a simple step by step, do as I say post, of how one can install a virtual Hadoop cluster in a few hours. I understand that this quite loaded, and is still not as simplified as I’d like it to be, but it should still work if one follows each instruction carefully.



Virtualization Software: VirtualBox

OS: Centos 7.2 64 bit .iso

Network: Internet access


The following 6 steps complete the process of installing a virtual Hadoop cluster on a host machine:

  1. Install Virtualization Software (VirtualBox is used in this demo)
  2. Install Guest OS (CentOS 7.2 64 bit)
  3. Creation and configuration of a base VM
  4. Cloning of nodes to the number required in the cluster
  5. Reconfiguring each node
  6. Installation of Hadoop (Cloudera Distribution including Apache Hadoop with Cloudera Manager)


       I.            Installation of VirtualBox


This step can be skipped if VirtualBox is already installed on the system.


Following installation, the following global network preferences have to be set:

File –> Preferences –> Network –> Host-Only Network

Let NAT Network settings remain as is, and make the following changes in Host-Only Network tab.

a)      Create a new network configuration here

Input IP Address:

Network Mask:


b)     Disable DHCP


     II.            Create a new VM – name it “base”

Leave standard settings for Linux (Red Hat) 64 bit

Go to settings of created VM

Settings à Network


Adaptor 1  –> NAT

Adaptor 2 -> Host-Only Adaptor

Name: Select the Adaptor created earlier

Promiscuous Mode: Deny

Cable Connected


  III.            Configure VM ‘base’

Power up the new Virtual Machine – ‘base’

The following configurations files have to be edited:


a)      Check the names of the two Ethernet adaptors

>> ip link show


3 adaptors should come up. One of them would be lo which is loopback for the VM and should stay the same.



The remaining two Ethernet adaptors will have to be configured. Basically one adaptor provides internet connectivity to the machine, whereas the other node is linked to other machines via a static IP address.


Please node: the names of the Ethernet adaptor i.e. eth0 / eth1, could be different in your system. Ip link show command would tell you the names to be used.


>> vi /etc/sysconfig/network-scripts/ifcfg-eth0





>> vi /etc/sysconfig/network-scripts/ifcfg-eth1








Restart Network Service

>> service network restart


b)      Network

>> vi /etc/sysconfig/network





c)       Hostname

>> vi /etc/hostname



d)      Disable SE Linux

>>vi /etc/selinux/config



e)      Firewall

Check Firewall Status

>>Systemctl status firewalld

Disable Firewall

>>Systemctl disable firewalld

>> service network restart


f)       Fastest Mirror

>> vi /etc/yum/pluginconf.d/fastestmirror.conf




g)      Setup Nodes FQDN and IP addresses hadoop1.example.com hadoop1 hadoop2.example.com hadoop2 hadoop3.example.com hadoop3 hadoop4.example.com hadoop4


h)      Setup SSH

>> ssh-keygen (type enter, enter, enter)

>>cd ~/.ssh

>> cp id_rsa.pub authorized_keys

 Modify SSH config

>> StrictHostKeyChecking no


  IV.            Clone VM


Clone the created VM the number of times required.

Make sure MAC addresses are reinitialized for each node.

    V.            Configure each node

The following configuration should be performed on each node or VM created.

a)      Hostname

>>vi /etc/hostname

Hadoop[x].example.com  {replace [x] with number of node}


b)      Network

>> HOSTNAME=hadoop[x].example.com


c)       Modify static IP address

The Ethernet adaptor containing static IP address has to be updated with a new IP address.

>> vi /etc/sysconfig/network-scripts/ifcfg-eth0

IPADDR=[x] .. {where [x] is to be replaced with the number of the node}


>> service network restart


d)      TESTING

>> ping google.com

Ping google.com from each node.

Also ping each other host from the IP addresses and domain names. As well SSH to each node.

>> ping hadoop2

>> ping hadoop3

>> ssh hadoop2

>> ssh hadoop3


e)      Reboot each node



  VI.            Install Hadoop Distribution

Install Hadoop using Cloudera Manager/ Apache Ambari on master node. There are many online guides via Cloudera  for Cloudera Manger or Hortonworks for Ambari on this.

For Cloudera Manager, I suggest this post-  http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/




Additional Testing – If needed!  

(Although if you’re not someone with a networking background, HOPE that these issues do not come up!)

  • Getting Ports to listen!

Check which ports are listening

nmap -sT -O localhost


Check ports running TCP

yum install net-tools  ……. If netstat not installed  on linux distro

netstat –t


Check specific port number

netstat -anp | grep 12345


Opening ports with Firewall

Systemctl start firewalld

firewall-cmd  –add-port=7182/tcp  – -permanent

firewall-cmd  –add-port=9000/tcp  – -permanent

firewall-cmd  –add-port=9001/tcp  – -permanent

firewall-cmd –reload


Use IP tables method

disable Firewalld again

service stop firewalld


ADD iptable port

iptables -A INPUT -m state –state NEW -p tcp –dport 8080 -j ACCEPT

and restart iptables with /etc/init.d/iptables restart





Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s