Hadoop Virtual Cluster Installation

Data scientists or others getting into big data analytics are from a wide range of backgrounds and hence many a time lack computer networking fundamentals to install a Hadoop cluster . Although, it would not make a difference for most data science profiles such as  those designing data research experiments or building data systems, as someone called a Data Engineer with a good background in databases and network engineering would handle Hadoop interactions. However, data scientists are expected to know how to deal with large volumes of data by employers, and Hadoop being synonymous with big data analysis, a working knowledge of a Hadoop cluster would be handy.

In order to experiment with Hadoop, there are sandbox environments, which are free of cost, provided by most major distributors, such as Cloudera or Hortonworks. These can be installed as a single virtual machine on a virtualization software such as VMware or VirtualBox on a desktop. Sandbox is essentially a pseudo cluster giving the feel of a cluster for development purposes, and installation is simple.

There are 2 basic steps to install the ‘sandbox’ –

  1. Install Virtualization Software  (VMware/VirtualBox)
  2. Download an install the Sandbox from the distributor (Cloudera/Hortonworks/MapR)

One such post demonstrating these steps – http://www.hadoopwizard.com/install-hadoop-on-windows-in-3-easy-steps-for-hortonworks-sandbox-tutorial/

 

If however, one feels the need to go a bit further and install an entire virtual cluster things can get difficult with network engineering basics.

There are many guides and posts on this, however a large part of them do not explain the basic requirement of a cluster which is getting a bunch of computers to talk to each other. This is actually not that difficult once a person understands the basics, however this post does not  get into these issues. This is a simple step by step, do as I say post, of how one can install a virtual Hadoop cluster in a few hours. I understand that this quite loaded, and is still not as simplified as I’d like it to be, but it should still work if one follows each instruction carefully.

 

Requirements

Virtualization Software: VirtualBox

OS: Centos 7.2 64 bit .iso

Network: Internet access

 

The following 6 steps complete the process of installing a virtual Hadoop cluster on a host machine:

  1. Install Virtualization Software (VirtualBox is used in this demo)
  2. Install Guest OS (CentOS 7.2 64 bit)
  3. Creation and configuration of a base VM
  4. Cloning of nodes to the number required in the cluster
  5. Reconfiguring each node
  6. Installation of Hadoop (Cloudera Distribution including Apache Hadoop with Cloudera Manager)

 

       I.            Installation of VirtualBox

 

This step can be skipped if VirtualBox is already installed on the system.

 

Following installation, the following global network preferences have to be set:

File –> Preferences –> Network –> Host-Only Network

Let NAT Network settings remain as is, and make the following changes in Host-Only Network tab.

a)      Create a new network configuration here

Input IP Address: 10.0.1.1

Network Mask: 255.255.255.0

 

b)     Disable DHCP

 

     II.            Create a new VM – name it “base”

Leave standard settings for Linux (Red Hat) 64 bit

Go to settings of created VM

Settings à Network

 

Adaptor 1  –> NAT

Adaptor 2 -> Host-Only Adaptor

Name: Select the Adaptor created earlier

Promiscuous Mode: Deny

Cable Connected

 

  III.            Configure VM ‘base’

Power up the new Virtual Machine – ‘base’

The following configurations files have to be edited:

 

a)      Check the names of the two Ethernet adaptors

>> ip link show

 

3 adaptors should come up. One of them would be lo which is loopback for the VM and should stay the same.

 

 

The remaining two Ethernet adaptors will have to be configured. Basically one adaptor provides internet connectivity to the machine, whereas the other node is linked to other machines via a static IP address.

 

Please node: the names of the Ethernet adaptor i.e. eth0 / eth1, could be different in your system. Ip link show command would tell you the names to be used.

 

>> vi /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0

BOOTPROTO=dhcp

ONBOOT=yes

 

>> vi /etc/sysconfig/network-scripts/ifcfg-eth1

DEVICE=eth1

IPADDR=172.16.134.11

NETMASK=255.255.255.0

BOOTPROTO=static

ONBOOT=yes

 

 

Restart Network Service

>> service network restart

 

b)      Network

>> vi /etc/sysconfig/network

NETWORKING=yes

HOSTNAME=base.example.com

GATEWAY=10.0.1.1

 

c)       Hostname

>> vi /etc/hostname

base.example.com

 

d)      Disable SE Linux

>>vi /etc/selinux/config

SELINUX=disabled

 

e)      Firewall

Check Firewall Status

>>Systemctl status firewalld

Disable Firewall

>>Systemctl disable firewalld

>> service network restart

 

f)       Fastest Mirror

>> vi /etc/yum/pluginconf.d/fastestmirror.conf

enabled=0

 

 

g)      Setup Nodes FQDN and IP addresses

 

10.0.1.201 hadoop1.example.com hadoop1

10.0.1.202 hadoop2.example.com hadoop2

10.0.1.203 hadoop3.example.com hadoop3

10.0.1.204 hadoop4.example.com hadoop4

 

h)      Setup SSH

>> ssh-keygen (type enter, enter, enter)

>>cd ~/.ssh

>> cp id_rsa.pub authorized_keys

 Modify SSH config

>> StrictHostKeyChecking no

 

  IV.            Clone VM

 

Clone the created VM the number of times required.

Make sure MAC addresses are reinitialized for each node.

    V.            Configure each node

The following configuration should be performed on each node or VM created.

a)      Hostname

>>vi /etc/hostname

Hadoop[x].example.com  {replace [x] with number of node}

 

b)      Network

>> HOSTNAME=hadoop[x].example.com

 

c)       Modify static IP address

The Ethernet adaptor containing static IP address has to be updated with a new IP address.

>> vi /etc/sysconfig/network-scripts/ifcfg-eth0

IPADDR=10.0.1.20[x] .. {where [x] is to be replaced with the number of the node}

 

>> service network restart

 

d)      TESTING

>> ping google.com

Ping google.com from each node.

Also ping each other host from the IP addresses and domain names. As well SSH to each node.

>> ping hadoop2

>> ping hadoop3

>> ssh hadoop2

>> ssh hadoop3

 

e)      Reboot each node

>>reboot

 

  VI.            Install Hadoop Distribution

Install Hadoop using Cloudera Manager/ Apache Ambari on master node. There are many online guides via Cloudera  for Cloudera Manger or Hortonworks for Ambari on this.

For Cloudera Manager, I suggest this post-  http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/

 

 

 

Additional Testing – If needed!  

(Although if you’re not someone with a networking background, HOPE that these issues do not come up!)

  • Getting Ports to listen!

Check which ports are listening

nmap -sT -O localhost

 

Check ports running TCP

yum install net-tools  ……. If netstat not installed  on linux distro

netstat –t

 

Check specific port number

netstat -anp | grep 12345

 

Opening ports with Firewall

Systemctl start firewalld

firewall-cmd  –add-port=7182/tcp  – -permanent

firewall-cmd  –add-port=9000/tcp  – -permanent

firewall-cmd  –add-port=9001/tcp  – -permanent

firewall-cmd –reload

 

Use IP tables method

disable Firewalld again

service stop firewalld

 

ADD iptable port

iptables -A INPUT -m state –state NEW -p tcp –dport 8080 -j ACCEPT

and restart iptables with /etc/init.d/iptables restart

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s