Tuesday, 18 December 2012

FUSE on Amazon S3

FUSE: File System In User Space, hosted on sourceforge, a well known open source project http://fuse.sourceforge.net/
You either put the files in S3 bucket directly or in the mount point, both will always be in the same hierarchy and in Sync. The best thing is that any arbitrary program can just point to this mount point and perform simple/ normal commands, rather than file system specific commands.

Here is a small documentation about how we can achieve this.

1.  Check out the code from google code.
$ svn checkout http://s3fs.googlecode.com/svn/trunk/ s3fs

2. Switch to the working directory
$ cd s3fs
$ ls 
AUTHORS  autogen.sh  ChangeLog  configure.ac  COPYING  doc  INSTALL  Makefile.am  NEWS  README  src  test

3. Now same old ritual of configure , make and install.
To run the subsequent command you need autoconf. So make sure you have it by running the following command.
$ sudo apt-get install autoconf
$ autoreconf --install 
It is silently notifying you that you lack the libraries. Time to get them installed...
$ sudo apt-get install build-essential libfuse-dev fuse-utils libcurl4-openssl-dev libxml2-dev mime-support

Getting back...
$ ./configure --prefix=/usr
$ make
$ sudo make install

4. Done with the Installation process.
Cross-check:
$ /usr/bin/s3fs  
s3fs: missing BUCKET argumentUsage: s3fs BUCKET:[PATH] MOUNTPOINT [OPTION]...

5. Add the following line to your ~/.bashrc file and source it.
export s3fs=/usr/bin/s3fs
$source ~/.bashrc$ s3fs s3fs: missing BUCKET argumentUsage: s3fs BUCKET:[PATH] MOUNTPOINT [OPTION]...

6. Install s3cmd. Many of you must be using this tool to interact with s3.
$ sudo apt-get install s3cmd$ s3cmd --configure 
This will configure with the S3 account using Access and Secret Key.

Configuring FUSE
1. First set use_allow_other for others to use. Uncomment in fuse.conf
$ vi /etc/fuse.conf

2. Set the AcessKey:SecretKey in the format in passwd-s3fs file
$ sudo vi /etc/passwd-s3fs
$ sudo chmod 640 /etc/passwd-s3fs
3. Created a bucket called "s3dir-sync" for this experiment.
$ s3cmd ls2012-12-18 09:23  s3://s3dir-sync
4. Creating a mount point where you want to dump/place the files and keep them in sync with the S3 bucket. Create as root user.
$ sudo mkdir -p /mnt/s3Sync$ sudo chmod 777 /mnt/s3Sync

5. With s3fs, as a root user.
$ sudo s3fs s3dir-sync -o default_acl=public-read -o allow_other /mnt/s3Sync/
Cross-check:
$ mount -ls3fs on /mnt/s3Sync type fuse.s3fs (rw,nosuid,nodev,allow_other)
If you try mounting again, you will get the following Warning
mount: according to mtab, s3fs is already mounted on /mnt/s3Sync

6. I created a directory structure of 
/mnt/s3Sync/
-> 2012/12/18$ more test.txt
This is a check file to sync with the s3dir-sync.
Blah..!

The same is synced in the bucket "s3dir-sync"
Cross-Check: 
$ s3cmd ls s3://s3dir-sync
DIR   s3://s3dir-sync/2012/
2012-12-18 09:57         0   s3://s3dir-sync/2012

Happy Learning! :)

Monday, 10 December 2012

Get your wireless working on DELL Inspiron 5220


I brought a new DELL Inspiron 5220. It's amazing!
Configuration :

  • 3rd Generation i5 Processor
  • 4GB RAM
  • 1TB Hard Disk
  • 15" Screen
  • 1GB Graphics

It ships with Windows 8! ;)
However, made a dual boot upon it. Although BIOS looked different this time!!

Well,
I'm working on Ubuntu :)
Release : 11.10 (Oneiric)
Kernel Linux : 3.0.0-28-generic
GNOME 3.2.1

But Wi-Fi was not getting detected. This was not unusual, as I had set this up in earlier Dell models.
Well, the remedy is easy.
Step 1:  Make sure that you can witness the Device Card and the ID. Especially, the Network !!
Type in the following command.

$ lspci -nnk | grep Network
08:00.0 Network controller [0280]: Intel Corporation Device [8086:0887] (rev c4)


Step 2: Figure out which kernel version. This is because the driver which we will be installing works on 2.6.37 or higher

$ uname -a
Linux Swathi 3.0.0-28-generic #45-Ubuntu SMP Wed Nov 14 21:57:26 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux


Step 3: Install the network manager

$ sudo apt-get install network-manager*

Step 4: Install few packages

$ sudo apt-get install build-essential linux-headers
Step 5: Check the output of
    $ dmesg
If it outputs the failure of firmware file, then its time to download the .ucode and place it in /lib/firmware.
Reboot
It should be working. If not, try Step 6.

Step 6: Download the compat wireless tarball from this location
http://linuxwireless.org/download/compat-wireless-2.6/compat-wireless-2012-05-10-p.tar.bz2

Extract from the tarball

$ tar -xvf <path_to_compat_wireless_bz2>
$ cd <extracted_path_compat_wireless>


Installing the packages

$ make
$ sudo make install

After this command it will show on the console the command to disable bluetooth, ethernet and Wi-Fi. Type in the 3 commands.
Place this module into the kernel.

$ sudo modprobe alx
The Ethernet LAN should be detected.
Add the driver module into this file : /etc/modules
Append the following lines. Don't touch the rest. This will enable this module while restarting the system as it loads the module.

$ sudo vi /etc/modules
#E2200 support
alx

Reboot your machine.
You must witness "Wi-Fi Networks Available!" notification on you desktop :)
Happy Learning! :)

Wednesday, 21 November 2012

Starfish : Hadoop Performance Tuning Tool

Its been a long time blogging... lapse of 3-4months or so... :( Well, I thought of writing about an awesome tool, I was playing with 4 months ago, for performance tuning in Hadoop, called “Starfish”.

What is Starfish?
Starfish is a Self-tuning System For Big Data Analytics. Its an open source project hosted at GitHub.
Github Link:
 https://github.com/jwlent55/Starfish


What is the need for Starfish?
Need for Performance!!


What it does and what are its components?
It enables Hadoop users and applications to get good performance automatically.
It has three main components.
1. Profiler
2. What-if Engine
3. Optimizer

1. Job Profile / Profiler :

  1. Profile is a concise statistical summary of MR Job execution.
  2. This profiling is based on the dataflow and cost estimation of MR Job.
  3. Data Flow estimation would be considered with the number of bytes of <K,V> pairs processed during a job’s execution.
  4. Cost estimation would be considered with execution time at the level of tasks and phases within the tasks for a MR job execution. (Basically, the resource usage and execution time)
  5. The performance models consider the above two and the configuration parameters associated with the MR Job.
  6. Space of configuration choices:
    • Number of map tasks
    • Number of reduce tasks
    • Partitioning of map outputs to reduce tasks
    • Memory allocation to task-level buffers
    • Multiphase external sorting in the tasks
    • Whether output data from tasks should be compressed
    • Whether combine function should be used ...
job j = < program p, data d, resources r, configuration c >
Thus, we can tell performance is a function of a job j.
perf = F(p,d,r,c)
Job profile is generated by Profiler through measurement or by the What-if Engine through estimation.

2. What-if Engine:
The What-if Engine uses a mix of simulation and model-based estimation at the phase level of MapReduce job execution, in order to predict the performance of a MapReduce job before executed on a Hadoop cluster.
It estimates the perf using properties of p, d, r, and c.
ie. Given profile for job j = <p, d1, r1, c1>
    Estimate profile for job j' = <p, d2, r2, c2>
It has white box models consisting detailed set of equations for Hadoop.
Example:
Input data properties
Dataflow statistics
Configuration parameters
⇒ Calculate dataflow in each task phase in a map task

3. Optimizer:
It finds the optimal configuration settings to use for executing a MapReduce job. It recommends and can also run with the recommended job configuration settings.

Benchmark:
Normal Execution:
Program : WordCount
Data Size : 4.45GB
Time taken to complete the job : 8m 5s

Starfish Profiling and Optimized Execution:
Program : WordCount
Data Size: 4.45GB
Time taken to complete the job : 4m 59s

Executed with cluster of 1 Master, 3 Slave nodes


What’s achieved?

  • Perform in-depth job analysis with profiles
  • Predict the behavior of hypothetical job executions
  • Optimize arbitrary MapReduce programs


Installation ??
It’s pretty easy to install.

  • Prerequisites :
    • Hadoop Cluster of 0.20.2 or 0.20.203.0 should be up and running. Tested for Cloudera Distributions.
    • Java JDK should be installed.


  • Compile the source code
    • Compile the entire source code and create the jar files:
    ant

    • Execute all available JUnit tests and verify the code was compiled successfully:
    ant test

    • Generate the javadoc documentation in docs/api:
    ant javadoc

Ensure that in ~/.bashrc,
JAVA_HOME and HADOOP_HOME environment variables are set.

  • BTrace Installation in the Slave Nodes
After the compilation, btrace directory created will contain all the classes and the jars. These must be shipped to the slave nodes.

  • Create a file (in Master node) “slaves_list.txt”
This file should contain the slave node IP addresses or the hostnames. Make sure the hostnames are updated in the Master node ie. /etc/hosts (IP address and their respective slave hostname).
Example :
$vi slaves_list.txt
slave1
slave2
slave3

  • Set the global profile parameter in bin/config.sh

  • SLAVES_BTRACE_DIR: BTrace installation directory at the slave nodes. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.
  • CLUSTER_NAME: A descriptive name for the cluster. Do not include spaces or special characters in the name.
  • PROFILER_OUTPUT_DIR: The local directory to place the collected logs and profile files. Please specify the full path and ensure you have the appropriate write permissions. The path will be created if it doesn't exist.

  • Run the script
bin/install_btrace.sh <absolute_path_slaves_list.txt>

  • This will copy the btrace jars in the SLAVES_BTRACE_DIR of the slave nodes.

This is all with the installation.

Execution is followed by

The link http://www.cs.duke.edu/starfish/tutorial/ is a great source to get started with both installation and execution. The documentation is equally great!
Happy Learning! :)

Wednesday, 8 August 2012

Tweak Ganglia With Hadoop

    Its very important to monitor all the machines in the cluster in terms of OS health, bottlenecks, performance hits and so on. There are numerous tools available which spits out huge number of graphs and statistics. But to the administrator of the cluster, only the prominent stats which seriously affects the performance or the health of the cluster, should be portrayed.
                        Ganglia, fits the bill.
Introduction to Ganglia:
    Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a    multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state. It leverages widely used technologies such as XML for data    representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency.
    The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on over 500 clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

Having known bits of the fact what Ganglia is, there are some pure terminologies to be known before the kick start.
1. Node : Generally, it's a machine targeted to perform a single task with its (1-4)core processor.
2. Cluster : Cluster consists of a group of nodes.
3. Grid : Grid consists of group of clusters.

Heading towards Ganglia...
    The ganglia system is comprised of two unique daemons, a PHP-based web frontend and a few other small utility programs.

- Ganglia Monitoring Daemon(gmond)
    Gmond is a multi-threaded daemon which runs on each cluster node you want to monitor.Its responsible for monitoring changes in host state, announcing relevant changes, listening to the state of all other ganglia nodes via a unicast or multicast channel and answer requests for an XML description of the cluster state. Each gmond transmits an information in two different ways unicasting/multicasting host state in external data representation (XDR) format using UDP messages or sending XML over a TCP connection.
 
- Ganglia Meta Daemon(gmetad)
    Ganglia Meta Daemon ("gmetad") periodically polls a collection of child data sources, parses the collected XML, saves all numeric, volatile metrics to round-robin databases and exports the aggregated XML over TCP sockets to clients. Data sources may be either "gmond" daemons, representing specific clusters, or other "gmetad" daemons, representing sets of clusters. Data sources use source IP addresses for access control and can be specified using multiple IP addresses for failover. The latter capability is natural for aggregating data from clusters since each "gmond" daemon contains the entire state of its cluster.
 
- Ganglia PHP Web Frontend
    The Ganglia web frontend provides a view of the gathered information via real-time dynamic web pages to the system administrators.

Setting up Ganglia:
    Assume the following:
- Cluster : "HadoopCluster"
- Nodes : "Master", "Slave1", "Slave2". {Considered only 3 nodes for examples. Similarly many nodes/slaves can be configured.}
- Grid : "Grid" consists of "HadoopCluster" for now.

    gmond is supposed to be installed on all the nodes ie. "Master", "Slave1", "Slave2". gmetad and web-frontend will be on Master. On, the Master node, we can see all the statistics in the web-frontend. However, we can have a dedicated server for web-frontend too.

Step 1: Install gmond on "Master", "Slave1" and "Slave2"
    Installing Ganglia can be done by downloading the respective tar.gz, extracting, configure, make and install. But, why to reinvent the wheel. Let's go by installing the same with repository.

OS: Ubuntu 11.04
Ganglia version 3.1.7
Hadoop version CDH3 hadoop-0.20.2

Update your repository packages.
$ sudo apt-get update

Installing depending packages.
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev

Installing gmond.
$  sudo apt-get install ganglia-monitor

Making changes in /etc/ganglia/gmond.conf
$  sudo vi /etc/ganglia/gmond.conf

/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {                   
  daemonize = yes             
  setuid = no
  user = ganglia             
  debug_level = 0              
  max_udp_msg_len = 1472       
  mute = no            
  deaf = no            
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no            
  send_metadata_interval = 0    
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "HadoopCluster"
  owner = "Master"
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = Master
  port = 8650
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
#  mcast_join = 239.2.11.71
  port = 8650
#  bind = 239.2.11.71
}

/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
} ...
About the configuration changes:
    Check the globals once. In the cluster {}, change the name of the cluster as assumed to "HadoopCluster" from unspecified, owner to "Master"( can be your organisation/admin name), latlong,url can be specified according to your location. No harm in keeping them unspecified.
    As said, gmond communicates using UDP messages or sending XML over a TCP connection. So, lets get this clear!

udp_send_channel {
  host = Master
  port = 8650
  ttl = 1
...
means that host recieving at the end point will be "Master" (where Master is the hostname with associated IP address. Add all the hostnames in /etc/hosts with the respective IPs). The port at which it accepts is 8650.
Since, gmond is configured now at "Master", the UDP recieve channel is 8650.

udp_recv_channel {
#  mcast_join = 239.2.11.71
  port = 8650
#  bind = 239.2.11.71
}
All the XML description which could be hadoop metrics, system metrics etc. is accepted at port:8649
tcp_accept_channel {
  port = 8649
} ..

Starting Ganglia:
$ sudo /etc/init.d/ganglia-monitor start
$ telnet Master 8649
Output should contain XML format.

This means gmond is up.
$ ps aux | grep gmond
shows gmond

Installing gmond on Slave machines is same with gmond.conf being,

/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
  daemonize = yes
  setuid = no
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  send_metadata_interval = 0
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "HadoopCluster"
  owner = "Master"
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = Master
  port = 8650
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {

#  mcast_join = 239.2.11.71
  port = 8650
 # bind = 239.2.11.71
}

/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
}....


Starting Ganglia:
$ sudo /etc/init.d/ganglia-monitor start
$ telnet Slave1 8649
Output should contain XML format


This means gmond is up.
$ ps aux | grep gmond
shows gmond

Step 2 : Installing gmetad on Master.
$ sudo apt-get install ganglia-webfrontend

Installing the dependencies
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-dev

Making the required changes in gmetad.conf
data_source "HadoopCluster"  Master
gridname "Grid"
setuid_username "ganglia"

datasource specifies the cluster name as "HadoopCluster" and Master as the sole point of consolidating all the metrics and statistics.
gridname is assumed as "Grid" initially.
username is ganglia.

Check for /var/lib/ganglia directory. If not existing then,
mkdir /var/lib/ganglia
mkdir /var/lib/ganglia/rrds/
and then
$ sudo chown -R ganglia:ganglia /var/lib/ganglia/

Running the gmetad
$ sudo /etc/init.d/gmetad start

You can stop with
$ sudo /etc/init.d/gmetad stop
and run in debugging mode once.
$ gmetad -d 1

Now, restart the daemon
$ sudo /etc/init.d/gmetad restart

Step 3 : Installing PHP Web-frontend dependent packages at "Master"
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-dev

Check for /var/www/ganglia directory. If not existing, then copy the contents of /usr/share/ganglia-webfrontend to /var/www/ganglia.

You must be able to see graph.d and many other .php. Restart apache2.
$ sudo /etc/init.d/apache2 restart

Time to hit Web URL
http://Master/ganglia
in general, http://<hostname>/ganglia/

You must be able to see some graphs.


Step 4 : Configuring Hadoop-metrics with Ganglia.
On Master( Namenode, JobTracker)
$ sudo vi /etc/hadoop-0.20/conf/hadoop-metrics.properties

# Configuration of the "dfs" context for null
#dfs.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "dfs" context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the "dfs" context for ganglia
 dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 dfs.period=10
 dfs.servers=Master:8650

# Configuration of the "dfs" context for /metrics
#dfs.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext


# Configuration of the "mapred" context for null
#mapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "mapred" context for /metrics
mapred.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "mapred" context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the "mapred" context for ganglia
 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 mapred.period=10
 mapred.servers=Master:8650


# Configuration of the "jvm" context for null
#jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "jvm" context for /metrics
jvm.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "jvm" context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the "jvm" context for ganglia
 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 jvm.period=10
 jvm.servers=Master:8650

# Configuration of the "rpc" context for null
#rpc.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "rpc" context for /metrics
rpc.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "rpc" context for file
#rpc.class=org.apache.hadoop.metrics.file.FileContext
#rpc.period=10
#rpc.fileName=/tmp/rpcmetrics.log

# Configuration of the "rpc" context for ganglia
 rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 rpc.period=10
 rpc.servers=Master:8650

Hadoop provides a way to access all the metrics with GangliaContext31 class.
Restart hadoop services at Master.
Restart gmond and gmetad.
$ telnet Master 8649
will spit XML metrics of Hadoop

On Slave1 (Secondary Namenode, Datanode, TaskTracker)
$ sudo gedit /etc/hadoop-0.20.2/conf/hadoop-metrics.properties

# Configuration of the "dfs" context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the "dfs" context for ganglia
 dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 dfs.period=10
 dfs.servers=Master:8650

# Configuration of the "dfs" context for /metrics
#dfs.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext


# Configuration of the "mapred" context for null
#mapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "mapred" context for /metrics
 mapred.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "mapred" context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the "mapred" context for ganglia
 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 mapred.period=10
 mapred.servers=Master:8650

# Configuration of the "jvm" context for null
#jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "jvm" context for /metrics
jvm.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "jvm" context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the "jvm" context for ganglia
 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 jvm.period=10
 jvm.servers=Master:8650

# Configuration of the "rpc" context for null
#rpc.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the "rpc" context for /metrics
rpc.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

# Configuration of the "rpc" context for file
#rpc.class=org.apache.hadoop.metrics.file.FileContext
#rpc.period=10
#rpc.fileName=/tmp/rpcmetrics.log

# Configuration of the "rpc" context for ganglia
 rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 rpc.period=10
 rpc.servers=Master:8650
Restart the datanode, tasktracker with the command
$ sudo service hadoop-0.20-tasktracker restart

$ sudo service hadoop-0.20-datanode restart
Restart gmond
$ telnet Master 8649
will spit the XML format of Hadoop metrics for this host Slave1

Similar procedure done to Slave1 must be followed for Slave2 and then restarting the services of hadoop and gmond.

On the Master,
Restart gmond and gmetad with
$ sudo /etc/init.d/ganglia-monitor restart
$ sudo /etc/init.d/gmetad restart

Hit the web URL
http://Master/ganglia
Check for Metrics, Grid, Cluster and all the nodes you configured for.
You can also witness the hosts up, hosts down, total CPUs. Lots more in store!
 

Enjoy monitoring your cluster! :)

Sunday, 1 July 2012

Hadoop Hangover: Time To Yarn [!Yawn ;)] and Watchout for Kitten!

         
Well, we have seen new versions and new releases in software as days roll by. No wonder!
Same is the case with Hadoop! Past few months we saw new releases on the table.
I would like to suggest the following amazing article by Jim, explaining the Evolving APIs.
A Good read and a MUST read!
http://wiki.eclipse.org/Evolving_Java-based_APIs

Releases?? Here it goes:
  • 1.x series is the continuation of 0.20 release series
  • 2.x series is the continuation of 0.23 release series

The 1.x series are the stable ones and majority of commercial distribution market carry forward the same. This release is more matured in terms of authentication.
In security parlance, there are two major buzzwords:
  • authentication means that a person who is performing an operation is the person who claims to be.
  • authorization means what a person can do with a particular file or directory.

Previously, it was just the authorization supported, giving a wide way for malicious users spoofing and gaining a free entry pass into the network cluster! The 1.x series now is more stable regarding this with the introduction of Kerberos, open-source network authentication protocol, to authenticate the user. Now, here there is a quite good handshake between these two open source fellows.
Kerberos, authenticates that the person is the one who he/she claims to be to Hadoop and then it is the work of Hadoop to do authorization checking on the permissions. It was in 0.20.20x series, the whole Kerberos was implemented.
To be known that its not so easy with this whole lotta Kerberos thingy, too many stuffs happening deep underneath! :)

The 2.x series and 0.22 series are not greatly stable as 1.x are! But, they are good in their own ways. They own lots of new features.
1. MapReduce2: The classic runtime is replaced by a new runtime MapReduce2 also known as YARN [Yet Another Resource Negotiator], which is basically a resource management system running in your cluster in the distributed environment.
2. High Availability: The namenode is no more a SPOF, Single Point Of Failure. There are back up nodes to avoid it. Meaning, there are couple of namenode active at any moment.
3. HDFS Federation: As the namenode is the main repository to keep whole lots of metadata and to imagine a cluster can be of so many nodes and its storage is huge equally! So, this metadata which has the information about each file and the block, becomes a limiting factor in terms of memory! Thus, we have a pair of namenodes, managing part of the filesystem namespace.

MapReduce2 or YARN:
If you have a large cluster it is very obvious that you can process large data sets. However, there was an issue with the classic runtime in the previous releases, wherein a cluster of >4000 machines hit a scalability issue. It is costly to have more number of small clusters. Thus, we need large number of machines in a cluster. This architecture will tend to increase the hardware utilization, need for other paradigms other than MapReduce, more agile.
To YARN is a beauty ;)
Actually, this architectural design is now more close to the Google Map Reduce paper. Reference: http://research.google.com/archive/mapreduce.html
Its an awesome Next Generation Hadoop Framework and MapReduce is just one application in that. There are many such applications written upon YARN.
  • Open MPI
  • Spark
  • Apache Giraph
  • Apache Hama
  • Apache HBase [Generic Co porocessors] etc.

So, it does makes sense that it can handle other paradigms and also different versions of MapReduce. We all know how complex MapReduce can be and solving every problem in terms of MR is an uphill task!

The classic runtime of previous release had a JobTracker, which had two important functions.
  • Resource Management
  • Job Scheduling and Job Monitoring

But, this architecture is a little bit of re-design of the previous dealing with the splitting of JobTracker into two major separate components.
  • Resource Manager: It will compute the resources required for the applications globally and assign/ manage them.
  • Per-Application Application Master: The scheduling and monitoring done by JobTracker previously is done per application by the application master.

These two are separate demons.What does application mean here?
Application is nothing but a job in the classic runtime.
The Application master will talk to resource manager for the resources in the cluster ie. also termed as "Container" (i/o, CPU, Memory etc.). Container is the home where all the application processes run and its the heavy duty of the node managers to check on the applications not to rob more resources from the container.
  • The MapReduce program will run the job at the client node in the client JVM.
  • The Job will now get the application from the resource manager and then copy all those resources to the file system which is shared by the cluster and submit it back to the resource manager.
  • It is now the duty of the resource manager to start the container at the node manager and that launching the Application Master.
  • The Application Master will initialise the job at the node manager and get the input splits from the shared file system of the cluster, allocating resources at the resource manager node.
  • The Application Master will start the container at the node manager where task are to be run, in parallel.
  • The Node Manager will start the task JVM at the node manager node, where the YarnChild will retrieve all the resources from the shared file system for it to run the corresponding map or reduce task at the node manager node.


However, its not so simple as listed above. Too many stuffs happening!
Well, this was just an introduction and developers who wanna play with YARN, do have a look at Kitten ( released just 5days before the day of writing this),
Link: https://github.com/cloudera/kitten, having a set of Java libraries and Lua-based configuration files that handle configuring, launching, and monitoring YARN applications, allowing developers to spend most of their time writing the logic of an application, not its configuration.
Will see more of all these in the coming posts.
Until then, Happy Learning!! :):)

Wednesday, 27 June 2012

Hadoop Hangover: Introduction To Apache Bigtop and Installing Hive, HBase and Pig

In the previous post we learnt how easy it was to install Hadoop with Apache Bigtop!
We know its not just Hadoop and there are sub-projects around the table! So, lets have a look at how to install Hive, Hbase and Pig in this post.


Before rowing your boat...
Please follow the previous post and get ready with Hadoop installed!
Follow the link for previous post:
http://femgeekz.blogspot.in/2012/06/hadoop-hangover-introduction-to-apache.html
also, the same can be found at DZone, developer site: http://www.dzone.com/links/hadoop_hangover_introduction_to_apache_bigtop_and.html


All Set?? Great! Head On..
Make sure all the services of Hadoop are running. Namely, JobTracker, SecondaryNameNode, TaskTracker, DataNode and NameNode. [standalone mode]


Hive with Bigtop:
The steps here are almost the same as Installing Hive as a separate project.
However, few steps are reduced.
The Hadoop installed in the previous post is Release 1.0.1


We had installed Hadoop with the following command
sudo apt-get install hadoop\*
Step 1: Installing Hive
We have installed Bigtop 0.3.0, and so issuing the following command installs all the hive components.
ie. hive, hive-metastore, hive-server. The daemons names are different in Bigtop 0.3.0.
sudo apt-get install hive\*
This installs all the hive components. After installing, the scripts must be able to create /tmp and /usr/hive/warehouseand HDFS doesn't allow these to be created while installing as it is unaware of the path to Java. So, create the directories if not created and grant the execute permissions.
In the hadoop directory, ie. /usr/lib/hadoop/
bin/hadoop fs -mkdir /tmp
bin/hadoop fs -mkdir /user/hive/warehouse

bin/hadoop -chmod g+x /tmp
bin/hadoop -chmod g+x /user/hive/warehouse



Step 2: The alternative directories could be/var/run/hiveand/var/lock/subsys
sudo mkdir /var/run/hive
sudo mkdir /var/lock/subsys



Step 3: Start the hive server, a daemon
sudo /etc/init.d/hive-server start
Image:
start hive-server






Step 4: Running Hive
Go-to the directory /usr/lib/hive.
See the Image below:
bin/hive
bin/hive
















Step 5: Operations on Hive
Image: 
Basic hive operations






HBase with Bigtop:
Installing Hbase is similar to Hive.


Step 1: Installing HBase
sudo apt-get install hbase\*
Image: 
hbase-0.92.0




Step 2: Starting HMaster
sudo service hbase-master start
Image: 
Starting HMaster


Image: 
jps (HMaster started)




Step 3: Starting HBase shell
hbase shell
Image: 
start HBase shell




Step 4: HBase Operations
Image: 
HBase table operations


Image: 
list,scan,get,describe In HBase




Pig with Bigtop:
Installing Pig is similar too.


Step 1: Installing Pig
sudo apt-get install pig
Image: 
Installing Pig




Step 2: Moving a file to HDFS
Image: 
Moving a tab separated file "book.csv" to HDFS




Step 3: Installed Pig-0.9.2
Image: 
Pig installed Pig-0.9.2




Step 4: Starting the grunt shell
pig
Image: 
Starting Pig




Step 5: Pig Basic Operations
Image:
Basic Pig Operations


Image:
Job Completion




We saw that is it possible to install the subprojects and work with Hadoop, with no issues.
Apache Bigtop has its own spark! :)
There is a release coming BIGTOP-0.4.0 which is supposedly to fix the following issues:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12318889&styleName=Html&projectId=12311420
Source and binary files:
http://people.apache.org/~rvs/bigtop-0.4.0-incubating-RC0
Maven staging repo:
https://repository.apache.org/content/repositories/orgapachebigtop-279
Bigtop's KEYS file containing PGP keys we use to sign the release:
http://svn.apache.org/repos/asf/incubator/bigtop/dist/KEYS


Let us see how to install other sub-projects in the coming posts!
Until then, Happy Learning!! :):)