How to Setup Hadoop 2.8.0 (Single Node Cluster) on CentOS

Introduction

Apache Hadoop 2.8.0 is a minor release in the 2.x.y release line, building upon the previous stable release 2.7.3.

The following are the features and improvements that are said to be available in Apache Hadoop 2.8.0

  • Common
    • Support async call retry and failover which can be used in async DFS implementation with retry effort.
    • Cross Frame Scripting (XFS) prevention for UIs can be provided through a common servlet filter.
    • S3A improvements: add ability to plug in any AWSCredentialsProvider, support read s3a credentials from Hadoop credential provider API in addition to XML configuration files, support Amazon STS temporary credentials
    • WASB improvements: adding append API support
    • Build enhancements: replace dev-support with wrappers to Yetus, provide a docker based solution to setup a build environment, remove CHANGES.txt and rework the change log and release notes.
    • Add posixGroups support for LDAP groups mapping service.
    • Support integration with Azure Data Lake (ADL) as an alternative Hadoop-compatible file system.
  • HDFS
    • WebHDFS enhancements: integrate CSRF prevention filter in WebHDFS, support OAuth2 in WebHDFS, disallow/allow snapshots via WebHDFS
    • Allow long-running Balancer to log in with keytab
    • Add ReverseXML processor which reconstructs an fsimage from an XML file. This will make it easy to create fsimages for testing, and manually edit fsimages when there is corruption
    • Support nested encryption zones
    • DataNode Lifeline Protocol: an alternative protocol for reporting DataNode liveness. This can prevent the NameNode from incorrectly marking DataNodes as stale or dead in highly overloaded clusters where heartbeat processing is suffering delays.
    • Logging HDFS operation’s caller context into audit logs
    • A new datanode command for evicting writers which is useful when data node decommissioning is blocked by slow writers.
  • YARN
    • NodeManager CPU resource monitoring in Windows.
    • NM shut down more graceful: NM will unregister to RM immediately rather than waiting for the timeout to be LOST (if NM work preserving is not enabled).
    • Add ability to fail a specific AM attempt in the scenario of AM attempt gets stuck.
    • CallerContext support in YARN audit log.
    • ATS versioning support: a new configuration to indicate timeline service version.
  • MAPREDUCE
    • Allow node labels get specified in submitting MR jobs
    • Add a new tool to combine aggregated logs into HAR file

       Reference: hadoop.apache.org

This blog will help you to install Hadoop 2.8.0 on CentOS operating system and this includes basic configuration required to start working with Hadoop. I have explained the entire process in simple and easy steps.

hadoop-logo

Step 1 – Installing Java

Java is required for running Hadoop on any system, So before installing hadoop make sure java is installed on your system


$ java -version

  • java version "1.8.0_121"
  • Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
  • Java HotSpot(TM) 64-Bit Server VM (build 24.121-b04, mixed mode)

If Java is not installed in the system then install it by using the following commands. To Install Java OpenJDK 8


$ sudo yum install java-1.8.0-openjdk

After installing Java configure Java Environment Variables /etc/profile.d/java.sh

export JAVA_HOME=/usr/lib/jvm/java-openjdk

export JAVA_PATH=$JAVA_HOME

export PATH=$PATH:$JAVA_HOME/bin

Step 2 – Setup Hadoop user account

It is recommended to create non-root user account for hadoop environment


$ adduser hadoop

$ passwd hadoop

Setup key based ssh to its own account


$ su - hadoop

$ ssh-keygen -t rsa

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ chmod 0600 ~/.ssh/authorized_keys

Let’s check key based login and exit from Hadoop


$ ssh localhost

Step 3 – Download Hadoop source file

Download Hadoop 2.8.0 source file, For different version, refer http://hadoop.apache.org


$ cd /usr/local

$ wget http://apache.claz.org/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz

$ tar xzf hadoop-2.8.0.tar.gz

$ mv hadoop-2.8.0 hadoop

Step 4 – Configure Hadoop Pseudo-Distributed Mode

  1. Setup Environment Variables

Edit ~/.bashrc file and append following values at end of file.

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now apply the changes in current running environment


$ source ~/.bashrc

Edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh and set JAVA_HOME

# Change Java home path as per java installed on your system

export JAVA_HOME=/usr/lib/jvm/java-openjdk

  1. Edit Configuration Files

Hadoop contains many configuration files, which need to be configured as per requirements of your hadoop environment.


$ cd $HADOOP_HOME/etc/hadoop

  1. i) Edit core-site.xml


<configuration>

 <property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

 </property>

</configuration>

  1. ii) Edit hdfs-site.xml


<configuration>

 <property>

<name>dfs.replication</name>

<value>1</value>

 </property>

 <property>

<name>dfs.name.dir</name>

<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>

 </property>

 <property>

<name>dfs.data.dir</name>

<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>

 </property>

</configuration>

iii) Edit mapred-site.xml


$ cp mapred-site.xml.template mapred-site.xml

<configuration>

 <property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

 </property>

</configuration>

  1. iv) Edit yarn-site.xml


<configuration>

 <property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

 </property>

</configuration>

  1. Format Hadoop Namenode

Once hadoop single node cluster setup has done, it’s time to initialize HDFS file system by formatting


$ hdfs namenode -format

Sample output:

17/02/14 08:13:20 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = ip-172-31-10-127.us-west-2.compute.internal/172.31.10.127

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 2.8.0

17/02/14 08:13:30 INFO namenode.FSImage: Allocated new BlockPoolId: BP-415680745-172.31.10.127-1487060010110

17/02/14 08:13:30 INFO common.Storage: Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted.

17/02/14 08:13:30 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

17/02/14 08:13:30 INFO util.ExitUtil: Exiting with status 0

17/02/14 08:13:30 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at ip-172-31-10-127.us-west-2.compute.internal/172.31.10.127

************************************************************/

Step 5 – Start Hadoop Cluster

Let’s start your Hadoop cluster using the scripts provides by hadoop. Just navigate to your Hadoop sbin directory and execute scripts one by one.


$ cd $HADOOP_HOME/sbin/

Run start-dfs.sh to start namenode, datanode and secondary namenodes


$ start-dfs.sh

Sample output:

17/02/14 08:16:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Starting namenodes on [localhost]

localhost: starting namenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-namenode-ip-172-31-10-127.out

localhost: starting datanode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-datanode-ip-172-31-10-127.out

Starting secondary namenodes [0.0.0.0]

The authenticity of host ‘0.0.0.0 (0.0.0.0)’ can’t be established.

RSA key fingerprint is a2:9b:7c:8f:21:43:6e:ce:18:5e:85:5b:a1:57:d2:99.

Are you sure you want to continue connecting (yes/no)? yes

0.0.0.0: Warning: Permanently added ‘0.0.0.0’ (RSA) to the list of known hosts.

0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop/logs/hadoop-hadoop-secondarynamenode-ip-172-31-10-127.out

17/02/14 08:16:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

Run start-yarn.sh to start daemons, resourcemanager and nodemanager


$ start-yarn.sh

Sample output:

starting yarn daemons

starting resourcemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-resourcemanager-ip-172-31-10-127.out

localhost: starting nodemanager, logging to /home/hadoop/hadoop/logs/yarn-hadoop-nodemanager-ip-172-31-10-127.out

To check services status run jps command.


$ jps

Sample output:

12544 NameNode

13001 ResourceManager

13104 NodeManager

12672 DataNode

13993 Jps

12843 SecondaryNameNode

Step 6 – Check Hadoop Services

Access 50070 for getting information about NameNode

http://HOST_NAME:50070/

Access 8088 for getting information about cluster

http://HOST_NAME:8088/

Access 50090 for getting information about secondary namenode.

http://HOST_NAME:50090/

Access 50075 for getting information about DataNode

http://HOST_NAME:50075/

Step 7 – Test Hadoop Setup

  1. i) Make the HDFS directories

$ bin/hdfs dfs -mkdir /user

$ bin/hdfs dfs -mkdir /user/hadoop

Manage Hadoop Services

To start all hadoop instances run the below commands


$ start-dfs.sh

$ start-yarn.sh

To stop all hadoop instances run the below commands


$ stop-yarn.sh

$ stop-dfs.sh

Hope this article helped you to easily setup Hadoop 2.8.0 (Single Node Cluster) on CentOS. If you have any doubts or queries please comment below. For updates follow agiratechnologies.