* Techie(S)pArK *: configuration

This post is about how-to launch a CDH4 MRv1 or CDH4 Yarn cluster on EC2 instances. It's said that you can launch a cluster with the help of Whirr and in a matter of 5 minutes! This is very true if and only if everything works out well! ;)

Hopefully, this article helps you in that regard.
So, let's row the boat...

Download the stable version of Apache Whirr ie. whirr-0.8.1.tar.gz from the following link whirr-0.8.1.tar.gz
Extract from the tarball and generate the key

Generate the key

Make a properties file to launch the cluster with that configuration.

Now let me tell you how to avoid getting headaches!

cluster name: Keep your cluster name simple. Avoid testCluster, testCluster1 etc. ie. No Caps, numerics..
Decide on the number of datanodes you want judiciously.
Your launch may not be successful, if java is not installed. Make sure the image has Java. However, this properties file takes care of that.
It will be good to go ahead with MRv1 for now and later switch to MRv2, when we get a production stable release.
This is the minimal set of configurations for launching a Hadoop cluster. But, you can do a lot performance tuning upon this.
I had launched this cluster from an ec2 instance, Initially i faced errors, regarding user. Setting the configuration below, solved the problem.

Set proper permissions for ~/.ssh and whirr-0.8.1 folder before launching.

Well, we are ready to launch the cluster. Name the properties file as "whirr_cdh.properties".

In the console you can see, links to Namenode and JobTracker Web UI. It also prints how to ssh to the instances in the end.

Now, you should be having the files generated. You will be able to see these files: instances, hadoop-proxy.sh and hadoop-site.xml
Starting the proxy

Open another terminal, and type
You should be able to access the HDFS.

You can alternatively download hadoop tarball and launch with

Okay! So I know that you will not be satisfied unless you a web UI

So, we are good to go!

If you want to launch MRv2, use this.

and the same process!
Happy Learning! :)

Its been a long time blogging... lapse of 3-4months or so... :( Well, I thought of writing about an awesome tool, I was playing with 4 months ago, for performance tuning in Hadoop, called “Starfish”.

What is Starfish?
Starfish is a Self-tuning System For Big Data Analytics. Its an open source project hosted at GitHub.
Github Link: https://github.com/jwlent55/Starfish

What is the need for Starfish?
Need for Performance!!

What it does and what are its components?
It enables Hadoop users and applications to get good performance automatically.
It has three main components.
1. Profiler
2. What-if Engine
3. Optimizer

1. Job Profile / Profiler :

Profile is a concise statistical summary of MR Job execution.
This profiling is based on the dataflow and cost estimation of MR Job.
Data Flow estimation would be considered with the number of bytes of <K,V> pairs processed during a job’s execution.
Cost estimation would be considered with execution time at the level of tasks and phases within the tasks for a MR job execution. (Basically, the resource usage and execution time)
The performance models consider the above two and the configuration parameters associated with the MR Job.
Space of configuration choices:

Number of map tasks
Number of reduce tasks
Partitioning of map outputs to reduce tasks
Memory allocation to task-level buffers
Multiphase external sorting in the tasks
Whether output data from tasks should be compressed
Whether combine function should be used ...

job j = < program p, data d, resources r, configuration c >
Thus, we can tell performance is a function of a job j.
perf = F(p,d,r,c)
Job profile is generated by Profiler through measurement or by the What-if Engine through estimation.

2. What-if Engine:
The What-if Engine uses a mix of simulation and model-based estimation at the phase level of MapReduce job execution, in order to predict the performance of a MapReduce job before executed on a Hadoop cluster.
It estimates the perf using properties of p, d, r, and c.
ie. Given profile for job j = <p, d1, r1, c1>
Estimate profile for job j' = <p, d2, r2, c2>
It has white box models consisting detailed set of equations for Hadoop.
Example:
Input data properties
Dataflow statistics
Configuration parameters
⇒ Calculate dataflow in each task phase in a map task

3. Optimizer:
It finds the optimal configuration settings to use for executing a MapReduce job. It recommends and can also run with the recommended job configuration settings.

Benchmark:
Normal Execution:
Program : WordCount
Data Size : 4.45GB
Time taken to complete the job : 8m 5s

Starfish Profiling and Optimized Execution:
Program : WordCount
Data Size: 4.45GB
Time taken to complete the job : 4m 59s

Executed with cluster of 1 Master, 3 Slave nodes

What’s achieved?

Perform in-depth job analysis with profiles

Predict the behavior of hypothetical job executions

Optimize arbitrary MapReduce programs

Installation ??
It’s pretty easy to install.

Prerequisites :

Hadoop Cluster of 0.20.2 or 0.20.203.0 should be up and running. Tested for Cloudera Distributions.

Java JDK should be installed.

Download from the repository

git clone from the repository https://github.com/hherodotou/Starfish

or download the tarball from here http://www.cs.duke.edu/starfish/files/starfish-0.3.0.tar.gz

Compile the source code

Compile the entire source code and create the jar files:

ant

Execute all available JUnit tests and verify the code was compiled successfully:

ant test

Generate the javadoc documentation in docs/api:

ant javadoc

Ensure that in ~/.bashrc,

JAVA_HOME and HADOOP_HOME environment variables are set.

BTrace Installation in the Slave Nodes

After the compilation, btrace directory created will contain all the classes and the jars. These must be shipped to the slave nodes.

Create a file (in Master node) “slaves_list.txt”

This file should contain the slave node IP addresses or the hostnames. Make sure the hostnames are updated in the Master node ie. /etc/hosts (IP address and their respective slave hostname).

Example :

$vi slaves_list.txt

slave1

slave2

slave3

Set the global profile parameter in bin/config.sh

SLAVES_BTRACE_DIR: BTrace installation directory at the slave nodes. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.

CLUSTER_NAME: A descriptive name for the cluster. Do not include spaces or special characters in the name.

PROFILER_OUTPUT_DIR: The local directory to place the collected logs and profile files. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.

Run the script

bin/install_btrace.sh <absolute_path_slaves_list.txt>

This will copy the btrace jars in the SLAVES_BTRACE_DIR of the slave nodes.

This is all with the installation.

Execution is followed by

Profiling

Job Analysis

What-if analysis

Optimisation

Follow for execution from this link http://www.cs.duke.edu/starfish/tutorial/profile.html

The link http://www.cs.duke.edu/starfish/tutorial/ is a great source to get started with both installation and execution. The documentation is equally great!
Happy Learning! :)

* Techie(S)pArK *

Monday, 11 February 2013

Hadoop Hangover: How-to launch a hadoop cluster CDH4 [MRv1 / YARN + Ganglia] using Apache Whirr

Wednesday, 21 November 2012

Starfish : Hadoop Performance Tuning Tool