* Techie(S)pArK *: November 2012

Its been a long time blogging... lapse of 3-4months or so... :( Well, I thought of writing about an awesome tool, I was playing with 4 months ago, for performance tuning in Hadoop, called “Starfish”.

What is Starfish?
Starfish is a Self-tuning System For Big Data Analytics. Its an open source project hosted at GitHub.
Github Link: https://github.com/jwlent55/Starfish

What is the need for Starfish?
Need for Performance!!

What it does and what are its components?
It enables Hadoop users and applications to get good performance automatically.
It has three main components.
1. Profiler
2. What-if Engine
3. Optimizer

1. Job Profile / Profiler :

Profile is a concise statistical summary of MR Job execution.
This profiling is based on the dataflow and cost estimation of MR Job.
Data Flow estimation would be considered with the number of bytes of <K,V> pairs processed during a job’s execution.
Cost estimation would be considered with execution time at the level of tasks and phases within the tasks for a MR job execution. (Basically, the resource usage and execution time)
The performance models consider the above two and the configuration parameters associated with the MR Job.
Space of configuration choices:

Number of map tasks
Number of reduce tasks
Partitioning of map outputs to reduce tasks
Memory allocation to task-level buffers
Multiphase external sorting in the tasks
Whether output data from tasks should be compressed
Whether combine function should be used ...

job j = < program p, data d, resources r, configuration c >
Thus, we can tell performance is a function of a job j.
perf = F(p,d,r,c)
Job profile is generated by Profiler through measurement or by the What-if Engine through estimation.

2. What-if Engine:
The What-if Engine uses a mix of simulation and model-based estimation at the phase level of MapReduce job execution, in order to predict the performance of a MapReduce job before executed on a Hadoop cluster.
It estimates the perf using properties of p, d, r, and c.
ie. Given profile for job j = <p, d1, r1, c1>
Estimate profile for job j' = <p, d2, r2, c2>
It has white box models consisting detailed set of equations for Hadoop.
Example:
Input data properties
Dataflow statistics
Configuration parameters
⇒ Calculate dataflow in each task phase in a map task

3. Optimizer:
It finds the optimal configuration settings to use for executing a MapReduce job. It recommends and can also run with the recommended job configuration settings.

Benchmark:
Normal Execution:
Program : WordCount
Data Size : 4.45GB
Time taken to complete the job : 8m 5s

Starfish Profiling and Optimized Execution:
Program : WordCount
Data Size: 4.45GB
Time taken to complete the job : 4m 59s

Executed with cluster of 1 Master, 3 Slave nodes

What’s achieved?

Perform in-depth job analysis with profiles

Predict the behavior of hypothetical job executions

Optimize arbitrary MapReduce programs

Installation ??
It’s pretty easy to install.

Prerequisites :

Hadoop Cluster of 0.20.2 or 0.20.203.0 should be up and running. Tested for Cloudera Distributions.

Java JDK should be installed.

Download from the repository

git clone from the repository https://github.com/hherodotou/Starfish

or download the tarball from here http://www.cs.duke.edu/starfish/files/starfish-0.3.0.tar.gz

Compile the source code

Compile the entire source code and create the jar files:

ant

Execute all available JUnit tests and verify the code was compiled successfully:

ant test

Generate the javadoc documentation in docs/api:

ant javadoc

Ensure that in ~/.bashrc,

JAVA_HOME and HADOOP_HOME environment variables are set.

BTrace Installation in the Slave Nodes

After the compilation, btrace directory created will contain all the classes and the jars. These must be shipped to the slave nodes.

Create a file (in Master node) “slaves_list.txt”

This file should contain the slave node IP addresses or the hostnames. Make sure the hostnames are updated in the Master node ie. /etc/hosts (IP address and their respective slave hostname).

Example :

$vi slaves_list.txt

slave1

slave2

slave3

Set the global profile parameter in bin/config.sh

SLAVES_BTRACE_DIR: BTrace installation directory at the slave nodes. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.

CLUSTER_NAME: A descriptive name for the cluster. Do not include spaces or special characters in the name.

PROFILER_OUTPUT_DIR: The local directory to place the collected logs and profile files. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.

Run the script

bin/install_btrace.sh <absolute_path_slaves_list.txt>

This will copy the btrace jars in the SLAVES_BTRACE_DIR of the slave nodes.

This is all with the installation.

Execution is followed by

Profiling

Job Analysis

What-if analysis

Optimisation

Follow for execution from this link http://www.cs.duke.edu/starfish/tutorial/profile.html

The link http://www.cs.duke.edu/starfish/tutorial/ is a great source to get started with both installation and execution. The documentation is equally great!
Happy Learning! :)

* Techie(S)pArK *

Wednesday, 21 November 2012

Starfish : Hadoop Performance Tuning Tool