Wednesday 21 November 2012

Starfish : Hadoop Performance Tuning Tool

Its been a long time blogging... lapse of 3-4months or so... :( Well, I thought of writing about an awesome tool, I was playing with 4 months ago, for performance tuning in Hadoop, called “Starfish”.

What is Starfish?
Starfish is a Self-tuning System For Big Data Analytics. Its an open source project hosted at GitHub.
Github Link:
 https://github.com/jwlent55/Starfish


What is the need for Starfish?
Need for Performance!!


What it does and what are its components?
It enables Hadoop users and applications to get good performance automatically.
It has three main components.
1. Profiler
2. What-if Engine
3. Optimizer

1. Job Profile / Profiler :

  1. Profile is a concise statistical summary of MR Job execution.
  2. This profiling is based on the dataflow and cost estimation of MR Job.
  3. Data Flow estimation would be considered with the number of bytes of <K,V> pairs processed during a job’s execution.
  4. Cost estimation would be considered with execution time at the level of tasks and phases within the tasks for a MR job execution. (Basically, the resource usage and execution time)
  5. The performance models consider the above two and the configuration parameters associated with the MR Job.
  6. Space of configuration choices:
    • Number of map tasks
    • Number of reduce tasks
    • Partitioning of map outputs to reduce tasks
    • Memory allocation to task-level buffers
    • Multiphase external sorting in the tasks
    • Whether output data from tasks should be compressed
    • Whether combine function should be used ...
job j = < program p, data d, resources r, configuration c >
Thus, we can tell performance is a function of a job j.
perf = F(p,d,r,c)
Job profile is generated by Profiler through measurement or by the What-if Engine through estimation.

2. What-if Engine:
The What-if Engine uses a mix of simulation and model-based estimation at the phase level of MapReduce job execution, in order to predict the performance of a MapReduce job before executed on a Hadoop cluster.
It estimates the perf using properties of p, d, r, and c.
ie. Given profile for job j = <p, d1, r1, c1>
    Estimate profile for job j' = <p, d2, r2, c2>
It has white box models consisting detailed set of equations for Hadoop.
Example:
Input data properties
Dataflow statistics
Configuration parameters
⇒ Calculate dataflow in each task phase in a map task

3. Optimizer:
It finds the optimal configuration settings to use for executing a MapReduce job. It recommends and can also run with the recommended job configuration settings.

Benchmark:
Normal Execution:
Program : WordCount
Data Size : 4.45GB
Time taken to complete the job : 8m 5s

Starfish Profiling and Optimized Execution:
Program : WordCount
Data Size: 4.45GB
Time taken to complete the job : 4m 59s

Executed with cluster of 1 Master, 3 Slave nodes


What’s achieved?

  • Perform in-depth job analysis with profiles
  • Predict the behavior of hypothetical job executions
  • Optimize arbitrary MapReduce programs


Installation ??
It’s pretty easy to install.

  • Prerequisites :
    • Hadoop Cluster of 0.20.2 or 0.20.203.0 should be up and running. Tested for Cloudera Distributions.
    • Java JDK should be installed.


  • Compile the source code
    • Compile the entire source code and create the jar files:
    ant

    • Execute all available JUnit tests and verify the code was compiled successfully:
    ant test

    • Generate the javadoc documentation in docs/api:
    ant javadoc

Ensure that in ~/.bashrc,
JAVA_HOME and HADOOP_HOME environment variables are set.

  • BTrace Installation in the Slave Nodes
After the compilation, btrace directory created will contain all the classes and the jars. These must be shipped to the slave nodes.

  • Create a file (in Master node) “slaves_list.txt”
This file should contain the slave node IP addresses or the hostnames. Make sure the hostnames are updated in the Master node ie. /etc/hosts (IP address and their respective slave hostname).
Example :
$vi slaves_list.txt
slave1
slave2
slave3

  • Set the global profile parameter in bin/config.sh

  • SLAVES_BTRACE_DIR: BTrace installation directory at the slave nodes. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.
  • CLUSTER_NAME: A descriptive name for the cluster. Do not include spaces or special characters in the name.
  • PROFILER_OUTPUT_DIR: The local directory to place the collected logs and profile files. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.

  • Run the script
bin/install_btrace.sh <absolute_path_slaves_list.txt>

  • This will copy the btrace jars in the SLAVES_BTRACE_DIR of the slave nodes.

This is all with the installation.

Execution is followed by

The link http://www.cs.duke.edu/starfish/tutorial/ is a great source to get started with both installation and execution. The documentation is equally great!
Happy Learning! :)