Tuesday, 18 June 2013

One Cap to rule 'em all ...

Wondering which cap it could be?
....Well its Capistrano :D

I am a fan of Capistrano from way back and we use it for almost all kind of deployments - Hadoop, MongoDB clusters and so on.
If you have not tried Capistrano, you must try it and figure out how you can use for deployments in your environment.

Its highly configurable - so capify your stuffs!

Checkout the capified scripts to Deploy a replicated sharded MongoDB cluster on AWS EC2 instances in the following link
https://github.com/SwathiMystery/deploy_shard_mongodb
Feel free to experiment, report bugs/ issues and contribute back.

For more details, follow the link below :
https://github.com/SwathiMystery/deploy_shard_mongodb/blob/master/README.md#deploy-replicated-sharded-mongodb-cluster

Monday, 15 April 2013

Monitoring S3 uploads for a real time data

        If you are working on Big Data and its bleeding edge technologies like Hadoop etc., the primary thing you need is a "dataset" to work on. So, this data can be reviews, blogs, news, social media data (Twitter, Facebook etc), domain specific data, research data, forums, groups, feeds, fire hose data etc. Generally, companies reach the data vendors to fetch such kind of data.

        Normally, these data vendors dump the data into a shared server kind of environment. For us to use this data for processing with MapReduce and so forth, we move them to S3 for storage first and processing next. Assume, the data belong to social media such as Twitter or Facebook, then the data can be dumped according to the date format directory. Majority of the cases, its the practice.
Also assuming 140-150GB /day being dumped in a hierarchy like 2013/04/15 ie. yyyy/mm/dd format, stream of data, how do you 
-  upload them to s3 in the same hierarchy to a given bucket?
-  monitor the new incoming files and upload them?
-  save the space effectively on the disk?
-  ensure the reliability of uploads to s3?
-  clean if the logging is enabled to track?
-  re-try the failed uploads?

These were some of the questions, running at the back of my mind, when I wanted to automate the uploads to S3. Also, I wanted 0 human intervention or at-least the least!
So, I came up with 
- s3sync / s3cmd.
- the python Watcher script by Greggory Hernandez, here https://github.com/greggoryhz/Watcher 
A big thanks! This helped me with monitoring part and it works so great!
- few of my own scripts.

What are the ingredients?
  •  Installation of s3sync. I have just used one script of s3cmd here and not s3sync in real. May be in future -- so I have this.
  • Install Ruby from the repository
    $ sudo apt-get install ruby libopenssl-ruby
    Confirm with the version
    $ ruby -v
    Download and unzip s3sync
    $ wget http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
    $ tar -xvzf s3sync.tar.gz
    Install the certificates.
    $ sudo apt-get install ca-certificates
    $ cd s3sync/
    Add the credentials to the s3config.yml for s3sync to connect to s3.
    $ cd s3sync/
    $ sudo vi s3config.yml
    aws_access_key_id: ABCDEFGHIJKLMNOPQRST
    aws_secret_access_key: hkajhsg/knscscns19mksnmcns
    ssl_cert_dir: /etc/ssl/certs
    Edit aws_access_key_id and aws_secret_access_key to your own credentials.
    view raw install-s3sync hosted with ❤ by GitHub
  • Installation of Watcher.
  • Goto https://github.com/greggoryhz/Watcher
    Copy https://github.com/greggoryhz/Watcher.git to your clipboard
    Install git if you have not
    Clone the Watcher
    $ git clone https://github.com/greggoryhz/Watcher.git
    $ cd Watcher/
    view raw install watcher hosted with ❤ by GitHub
  • My own wrapper scripts.
  • cron
Next, having set up of the environment ready, lets make some common "assumptions".
  • Data being dumped will be at /home/ubuntu/data/ -- from there it could be 2013/04/15 for ex.
  • s3sync is located at /home/ubuntu
  • Watcher repository is at /home/ubuntu
Getting our hands dirty...
  • Goto Watcher and set the directory to be watched for and corresponding action to be undertaken.
  • $ cd Watcher/
    Start the script,
    $ sudo python watcher.py start
    This will create a .watcher dirctory at /home/ubuntu
    Now,
    $ sudo python watcher.py stop
    Goto the .watcher directory created and
    set the destination to be watched for and action to be undertaken
    in jobs.yml ie. watch: and command:
    # Copyright (c) 2010 Greggory Hernandez
    # Permission is hereby granted, free of charge, to any person obtaining a copy
    # of this software and associated documentation files (the "Software"), to deal
    # in the Software without restriction, including without limitation the rights
    # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    # copies of the Software, and to permit persons to whom the Software is
    # furnished to do so, subject to the following conditions:
    # The above copyright notice and this permission notice shall be included in
    # all copies or substantial portions of the Software.
    # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
    # THE SOFTWARE.
    # ---------------------------END COPYRIGHT--------------------------------------
    # This is a sample jobs file. Yours should go in ~/.watcher/jobs.yml
    # if you run watcher.py start, this file and folder will be created
    job1:
    # a generic label for a job. Currently not used make it whatever you want
    label: Watch /home/ubuntu/data for added or removed files
    # directory or file to watch. Probably should be abs path.
    watch: /home/ubuntu/data
    # list of events to watch for.
    # supported events:
    # 'access' - File was accessed (read) (*)
    # 'atrribute_change' - Metadata changed (permissions, timestamps, extended attributes, etc.) (*)
    # 'write_close' - File opened for writing was closed (*)
    # 'nowrite_close' - File not opened for writing was closed (*)
    # 'create' - File/directory created in watched directory (*)
    # 'delete' - File/directory deleted from watched directory (*)
    # 'self_delete' - Watched file/directory was itself deleted
    # 'modify' - File was modified (*)
    # 'self_move' - Watched file/directory was itself moved
    # 'move_from' - File moved out of watched directory (*)
    # 'move_to' - File moved into watched directory (*)
    # 'open' - File was opened (*)
    # 'all' - Any of the above events are fired
    # 'move' - A combination of 'move_from' and 'move_to'
    # 'close' - A combination of 'write_close' and 'nowrite_close'
    #
    # When monitoring a directory, the events marked with an asterisk (*) above
    # can occur for files in the directory, in which case the name field in the
    # returned event data identifies the name of the file within the directory.
    events: ['create', 'move_from', 'move_to']
    # TODO:
    # this currently isn't implemented, but this is where support will be added for:
    # IN_DONT_FOLLOW, IN_ONESHOT, IN_ONLYDIR and IN_NO_LOOP
    # There will be further documentation on these once they are implmented
    options: []
    # if true, watcher will monitor directories recursively for changes
    recursive: true
    # the command to run. Can be any command. It's run as whatever user started watcher.
    # The following wildards may be used inside command specification:
    # $$ dollar sign
    # $watched watched filesystem path (see above)
    # $filename event-related file name
    # $tflags event flags (textually)
    # $nflags event flags (numerically)
    # $dest_file this will manage recursion better if included as the dest (especially when copying or similar)
    # if $dest_file was left out of the command below, Watcher won't properly
    # handle newly created directories when watching recursively. It's fine
    # to leave out when recursive is false or you won't be creating new
    # directories.
    # $src_path is only used in move_to and is the corresponding path from move_from
    # $src_rel_path [needs doc]
    command: sudo sh /home/ubuntu/s3sync/monitor.sh $filename
  • Create a script called monitor.sh to upload to s3 in s3sync directory as below.
    • The variables you may like to change is s3bucket path in "s3path" in monitor.sh
    • This script will upload the new incoming file detected by the watcher script in the reduced redundancy storage format. (you can remove the header -- provided you are not interested to store in RRS format)
    • The script will call s3cmd ruby script to upload recursively and thus maintains the hierarchy ie. yyyy/mm/dd format with files *.*
    • It will delete the file successfully uploaded to s3 from the local path -- to save the disk space.
    • The script would not delete the directory, as it will be taken care by yet another script re-upload.sh, which acts as a backup for the failed uploads to be uploaded again to s3.
    Goto s3sync directory
    $ cd ~/s3sync
    $ sudo vim monitor.sh
    #!/bin/bash
    ##...........................................................##
    ## script to upload to S3BUCKET, once the change is detected ##
    ##...........................................................##
    ## AWS Credentials required for s3sync ##
    export AWS_ACCESS_KEY_ID=ABCDEFGHSGJBKHKDAKS
    export AWS_SECRET_ACCESS_KEY=jhhvftGFHVgs/bagFVAdbsga+vtpmefLOd
    export SSL_CERT_DIR=/etc/ssl/certs
    #echo "Running monitor.sh!"
    echo "[INFO] File or directory modified = $1 "
    ## Read arguments
    PASSED=$1
    # Declare the watch path and S3 destination path
    watchPath='/home/ubuntu/data'
    s3path='bucket-data:'
    # Trim watch path from PASSED
    out=${PASSED#$watchPath}
    outPath=${out#"/"}
    echo "[INFO] ${PASSED} will be uploaded to the S3PATH : $s3path$outPath"
    if [ -d "${PASSED}" ]
    then echo "[SAFEMODE ON] Directory created will not be uploaded, unless a file exists!"
    elif [ -f "${PASSED}" ]
    then ruby /home/ubuntu/s3sync/s3cmd.rb --ssl put $s3path$outPath ${PASSED} x-amz-storage-class:REDUCED_REDUNDANCY; #USE s3cmd : File
    else echo "[ERROR] ${PASSED} is not valid type!!";
    exit 1
    fi
    RETVAL=$?
    [ $RETVAL -eq 0 ] && echo "[SUCCESS] Upload successful! " &&
    if [ -d "${PASSED}" ]
    then echo "[SAFEMODE ON] ${PASSED} is a directory and its not deleted!";
    elif [ -f "${PASSED}" ]
    then sudo rm -rf ${PASSED}; echo "[SUCCESS] Sync and Deletion successful!";
    fi
    [ $RETVAL -ne 0 ] && echo "[ERROR] Synchronization failed!!"
    view raw monitor.sh hosted with ❤ by GitHub
  • Create a script called re-upload.sh which will upload the failed file uploads.
    • This script ensures that the files that are left over from monitor.sh (failed uploads -- this chance is very less. May be 2-4 files/day. -- due to various reasons.), will be uploaded to s3 again with the same hierarchy in RRS format.
    •  Post successful upload, deletes the file and hence the directory if empty.
    Goto s3sync directory.
    $ cd s3sync
    $ sudo vim re-upload.sh
    #!/bin/bash
    ##.........................................................##
    ## script to detect failed uploads of other date directories
    ## and re-try ##
    ##.........................................................##
    ## AWS Credentials required for s3sync ##
    export AWS_ACCESS_KEY_ID=ABHJGDVABU5236DVBJD
    export AWS_SECRET_ACCESS_KEY=hgvgvjhgGYTfs/I5sdn+fsbfsgLKjs
    export SSL_CERT_DIR=/etc/ssl/certs
    # Get the previous date
    today_date=$(date -d "1 days ago" +%Y%m%d)
    year=$(date -d "1 days ago" +%Y%m%d|head -c 4|tail -c 4)
    month=$(date -d "1 days ago" +%Y%m%d|head -c 6|tail -c 2)
    yday=$(date -d "1 days ago" +%Y%m%d|head -c 8|tail -c 2)
    # Set the path of data
    basePath="/home/ubuntu/data"
    datePath="$year/$month/$yday"
    fullPath="$basePath/$datePath"
    echo "Path checked for: $fullPath"
    # Declare the watch path and S3 destination path
    watchPath='/home/ubuntu/data'
    s3path='bucket-data:'
    # check for left files (failed uploads)
    if [ "$(ls -A $fullPath)" ]; then
    for i in `ls -a $fullPath/*.*`
    do
    echo "Left over file: $i";
    if [ -f "$i" ]
    then out=${i#$watchPath};
    outPath=${out#"/"};
    echo "Uploading to $s3path/$outPath";
    ruby /home/ubuntu/s3sync/s3cmd.rb --ssl put $s3path$outPath $i x-amz-storage-class:REDUCED_REDUNDANCY; #USE s3cmd : File
    RETVAL=$?
    [ $RETVAL -eq 0 ] && echo "[SUCCESS] Upload successful! " &&
    sudo rm -rf $i &&
    echo "[SUCCESS] Deletion successful!"
    [ $RETVAL -ne 0 ] && echo "[ERROR] Upload failed!!"
    else echo "[CLEAN] no files exist!!";
    exit 1
    fi
    done
    else
    echo "$fullPath is empty";
    sudo rm -rf $fullPath;
    echo "Successfully deleted $fullPath"
    exit 1
    fi
    # post failed uploads -- delete empty dirs
    if [ "$(ls -A $fullPath)" ]; then
    echo "Man!! Somethingz FISHY! All (failed)uploaded files will be deleted. Are there files yet!??";
    echo "Man!! I cannot delete it then! Please go check $fullPath";
    else
    echo "$fullPath is empty after uploads";
    sudo rm -rf $fullPath;
    echo "Successfully deleted $fullPath"
    fi
    view raw re-upload.sh hosted with ❤ by GitHub
  • Now, more dirtiest work -- Logging and cleaning logs.
    • All the "echo" created in monitor.sh can be found in ~/.watcher/watcher.log when the watcher.py is running.
    • This log helps us initially and may be later too, to backtrack errors or so.
    • Call of duty - Janitor for cleaning logs. To do this, we can use cron to run a script at sometime. I was interested to run - Every Saturday at 8.00 AM
    • Create a script to clean log as "clean_log.sh" in /home/ubuntu/s3sync
  • Time for cron
  • $ crontab -e
    Add the following lines at the end and save.
    # EVERY SATURDAY 8:00AM clean watcher log
    0 8 * * 6 sudo sh /home/ubuntu/s3sync/clean_log.sh
    # EVERYDAY at 10:00AM check failed uploads of previous day
    0 10 * * * sudo sh /home/ubuntu/s3sync/re-upload.sh
    view raw crontab hosted with ❤ by GitHub
    • All set! logging clean happens every Saturday 8.00 AM and re-upload script runs for the previous day, to check if files exist and does the cleaning accordingly.
  • Let's start the script
  • Goto Watcher repository
    $ cd ~/Watcher
    $ sudo python watcher.py start
    This will create ~/.watcher directory and has watcher.log in it,
    when started.
    view raw start watcher hosted with ❤ by GitHub
So, this assures successful uploads  to S3. 
My bash-fu with truth! ;)
Happy Learning! :)

Tuesday, 19 March 2013

Phabricator - pretty()

Ah pretty() -- liking it too much! These days I'm into MongoDB so.. :P
Its like 3am ish now and I know, I'm blabbering ... Gotta sleep!
Ah! please bear with rest of my ramblings!

Last month, I was looking into some of the code review tools!
You know - reviewing the review tools! ;)

Well, I experimented with reviewboard, gerrit, barkeep.
Gerrit does not support post-commit review. I was looking for, a way to conduct code review, after the commit is pushed to git.
Reviewboard is cool. Its existing from a longer time. I experimented a lot on the cloud. Well, the post-review process was little painful with command line.
Barkeep installation is kind of different. I did not invest much time. But it looks interesting!

I went with Phabricator, released by Facebook and Open source too.
Wanna contribute ? Fork here https://github.com/facebook/phabricator/  : else go http://phabricator.org/
Okay I don't wanna geeko-Fi more with operators ;)

Well, Phabricator is used by many companies like Quora (unsure if they are presently using it or not, have a doubt , as I read in some quora thread), DropBox, Path (Android app), Disqus (online commenting system), deviantArt( Painting app -- famous in Chrome web store) .... Ah well, lots more!!


  • What's cool about Phabricator?

It has complete software tools. Ok!! You want ...

    • Code Review ? Differential
    • Post Commit Review ? Audits
    • Code Browser ? Diffusion
    • Bug Tracking ? Maniphest
    • Wiki ? Phriction
    • Want more ? It's under development... So you gonna get more too!
Well, personally, I prefer using Audits, Diffusion, Maniphest. 
Wiki - is used in Confluence/BitBucket -- so for now not essential.
Code Review -- Pre-commit reveiw process stalls the development in agile kinda environment. So not going with it.

  • How's the installation?
I read a question in Quora that why is it difficult to install Phabricator. I believe its not so difficult too. 
May be I will post -- how to install/configure everything with Phabricator. 
Please bear with the mentions about Quora, I'm a quor(a)ddict! :D

  • How to start with Audits?
To get started with the workflow, it does take time, probably because its still under development and documentation demands update. But the greatest thing, I love about Phabricator is, its still under the development and yet, functions -- as it purports to be!! That's Amazing! The UI is lovely too!

I faced this issue -- If you have huge repository, where codereview has never happened before, how do you start the process with previous commits? 
Unfortunately, you cannot do with existing commits! The commits that occur after installation/configuration of Phabricator with that repository, audits are possible.
You will have to create some rules that trigger post commit audits. Its called Herald in Phabricator.

I found a workaround for the old commits -- if you wanna review -- You can browse the repository with Diffusion, goto the module / code you are interested in and click on the commit number. 
Upon clicking on commit number, you have the freedom to post in-line comments/review comments, where even the diff is visible. 
After finishing your commenting process, you can cook it with Raise Concern. I also prefer mapping this with a task tracker -- I mean bug tracker ( Maniphest ) -- add the assignee, link to the comments you gave in description and assign. 
This will send a mail to the team.

Email configuration is one big task. 
I used Postfix with SMTP as configuring outbound mail and have updated in github Phabricator.

Mapping Your Projects/Audits/Team
This takes a lot of time and effort too. 
  • First, you need to create accounts for all. 
  • You will have to configure your remote repository (using git). 
  • Map the modules with the module owners and team lead.
  • Create herald rule which triggers audits.

So, you can set up completely in 2days. But its fun!! Audits can be real fun..!! :)
I will really recommend Phabricator! :D
Happy Tweaking! :)

Wednesday, 20 February 2013

Books that every developer must read!

        I have a rack-space of books! I just wanted to make a wishlist of all the books which I've read and yet to read! :) 

1.
Design Patterns: Elements of Reusable Object-Oriented Software
Erich Gamma Richard Helm Ralph Johnson John Vlissides

2.
Structure And Interpretation Of Computer Programs, Second Edition : By Harold Abelson and Gerald Jay Sussman

3.
Refactoring to Patterns
Joshua Kerievsky

4.
Types and Programming Languages
Benjamin C. Pierce

5.
Code: The Hidden Language of Computer Hardware and Software
Charles Petzold

6.
Object-Oriented Analysis and Design with Applications (2nd Edition)
Grady Booch

7.
Code Complete: A Practical Handbook of Software Construction, Second Edition
Steve McConnell

8.
The Design of the UNIX Operating System [Prentice-Hall Software Series]
Maurice J. Bach

9.
The Pragmatic Programmer: From Journeyman to Master
Andrew Hunt  David Thomas

10.
Practical API Design: Confessions of a Java Framework Architect
Jaroslav Tulach

11.
The Practice of Programming (Addison-Wesley Professional Computing Series)
Brian W. Kernighan  Rob Pike


12.
Programming Pearls (2nd Edition)
Jon Bentley

13.
Writing Secure Code, Second Edition
Michael Howard David LeBlanc

14.
The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition (2nd Edition)
Frederick P. Brooks Jr.

15.
Patterns of Enterprise Application Architecture
Martin Fowler

16.
Introduction to Functional Programming (Prentice Hall International Series in Computing Science)
Richard Bird

17.
The Art of Computer Programming
Donald E. Knuth

18.
Effective Java (2nd Edition)
Joshua Bloch


19.
Thinking in Java (4th Edition) 
Bruce Eckel

20.
Programmers at Work: Interviews With 19 Programmers Who Shaped the Computer Industry (Tempus)
Susan Lammers


21.
Coders at Work: Reflections on the Craft of Programming Peter Seibel













Well, I compiled all the aforementioned from Amazon. I will keep appending to list, if I remember :)
Happy Learning! :)

Monday, 11 February 2013

Hadoop Hangover: How-to launch a hadoop cluster CDH4 [MRv1 / YARN + Ganglia] using Apache Whirr


  This post is about how-to launch a CDH4 MRv1 or CDH4 Yarn cluster on EC2 instances. It's said that you can launch a cluster with the help of Whirr and in a matter of 5 minutes! This is very true if and only if everything works out well! ;) 

Hopefully, this article helps you in that regard.
So, let's row the boat...
  • Download the stable version of Apache Whirr  ie. whirr-0.8.1.tar.gz from the following link whirr-0.8.1.tar.gz
  • Extract from the tarball and generate the key 
  • $ tar -xzvf whirr-0.8.1.tar.gz
    $ cd whirr-0.8.1
    view raw whirr untar hosted with ❤ by GitHub
  • Generate the key
  • $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
    $ cd whirr-0.8.1
    view raw key-gen hosted with ❤ by GitHub
  • Make a properties file to launch the cluster with that configuration.
  • # Cluster name goes here
    whirr.cluster-name=testcluster
    # Change the number of machines in the cluster here
    # Using 3 DN and TT and 1JT and NN# Ganglia is configured
    whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode+ganglia-monitor+ganglia-metad,3 hadoop-datanode+hadoop-tasktracker+ganglia-monitor
    # Install JAVA
    whirr.java.install-function=install_openjdk
    whirr.java.install-function=install_oab_java
    ## Install CDH4 MRV1
    whirr.hadoop.install-function=install_cdh_hadoop
    whirr.hadoop.configure-function=configure_cdh_hadoop
    whirr.env.REPO=cdh4
    # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
    whirr.provider=aws-ec2
    whirr.hardware-id=c1.xlarge
    # Credentials should go here
    whirr.identity=XXXXXXXXXXXXXXXXX
    whirr.credential=XXXXXXXXXXXXXXXXXXXX
    whirr.cluster-user=whirr
    whirr.private-key-file=/home/ubuntu/.ssh/yourKey
    whirr.public-key-file=/home/ubuntu/.ssh/yourKey.pub
    view raw mrv1 cdh4 hosted with ❤ by GitHub
  • Now let me tell you how to avoid getting headaches!
    • cluster name: Keep your cluster name simple. Avoid testCluster, testCluster1 etc. ie. No Caps, numerics..
    • Decide on the number of datanodes you want judiciously.
    • Your launch may not be successful, if java is not installed. Make sure the image has Java. However, this properties file takes care of that.
    • It will be good to go ahead with MRv1 for now and later switch to MRv2, when we get a production stable release.
    • This is the minimal set of configurations for launching a Hadoop cluster. But, you can do a lot performance tuning upon this.
    • I had launched this cluster from an ec2 instance, Initially i faced errors, regarding user. Setting the configuration below, solved the problem.
    • whirr.cluster-user=whirr
      view raw cluster user hosted with ❤ by GitHub
    • Set proper permissions for ~/.ssh and whirr-0.8.1 folder before launching.
  •  Well, we are ready to launch the cluster. Name the properties file as "whirr_cdh.properties".
  • $ cd whirr-0.8.1
    $ bin/whirr launch-cluster --config whirr_cdh.properties
    view raw launch-whirr hosted with ❤ by GitHub
In the console you can see, links to Namenode and JobTracker Web UI. It also prints how to ssh to the instances in the end.

  • Now, you should be having the files generated. You will be able to see  these files: instances, hadoop-proxy.sh and hadoop-site.xml
  • Starting the proxy
  • $ sh hadoop-proxy.sh
    view raw start-proxy hosted with ❤ by GitHub
  • Open another terminal, and type
  • You should be able to access the HDFS.
  • $ export HADOOP_CONF_DIR=~/.whirr/testcluster/hadoop-site.xml
    $ hadoop fs -ls /
    view raw list files hdfs hosted with ❤ by GitHub
  • You can alternatively download hadoop tarball and launch with 
  • $ bin/hadoop --config ~/.whirr/testcluster fs -ls /
    view raw from tool hosted with ❤ by GitHub
  •  Okay! So I know that you will not be satisfied unless you a web UI
  • Now, Launch Firefox (3.0v+)
    Download the FoxyProxy extension by clicking this link:https://addons.mozilla.org/en-US/firefox/addon/2464.
    Steps to configure and access the UI
    Select Tools > FoxyProxy > Options
    Click the “Add New Proxy” button.
    Select “Manual Proxy Configuration”
    Enter “localhost” for the “Host or IP Address” field.
    Enter “6666″ for the “Port” field.
    Click on the “General” tab at the top of the dialog box.
    Enter “EC2″ for the “Proxy Name” field.
    Click on the “URL Patterns” tab at the top of the dialog box.
    Click the “Add New Pattern” button.
    Enter “EC2″ for the “Pattern Name” field.
    Enter “*compute-1.amazonaws.com*, *.ec2.internal*, *.compute-1.internal*” for the “URL pattern” field (not case sensitive)
    Select the “Whitelist” and “Wildcards” radio buttons.
    Click the “OK” button to dismiss the new URL pattern dialog box.
    Click the “OK” button to dismiss the new proxy dialog box.
    Completely disable the Foxyproxy for now.
    You should be able to see 2 proxy names after closing, default and EC2.
    Click on “Use proxy EC2 for all URLs” from the pop-up menu of FoxyProxy
    Copy the URL of JobTracker (can be seen while running proxy, ec2-***-**-***-**.********.amazonaws.com) and paste it in the browser.
So, we are good to go! 
  •   If you want to launch MRv2,  use this.
  • ## Cluster name goes here.
    whirr.cluster-name=yarncluster
    # Change the number of machines in the cluster here
    whirr.instance-templates=1 hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver,2 hadoop-datanode+yarn-nodemanager
    # Install JAVA
    whirr.java.install-function=install_openjdk
    whirr.java.install-function=install_oab_java
    ## Install CDH4 Yarn
    whirr.hadoop.install-function=install_cdh_hadoop
    whirr.hadoop.configure-function=configure_cdh_hadoop
    whirr.yarn.configure-function=configure_cdh_yarn
    whirr.yarn.start-function=start_cdh_yarn
    whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory
    whirr.env.REPO=cdh4
    whirr.env.MAPREDUCE_VERSION=2
    # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
    whirr.provider=aws-ec2
    whirr.hardware-id=c1.xlarge
    # Credentials should go here
    whirr.identity=XXXXXXXXXXXXXXXXX
    whirr.credential=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    whirr.cluster-user=whirr
    whirr.private-key-file=/home/ubuntu/.ssh/yourKey
    whirr.public-key-file=/home/ubuntu/.ssh/yourKey.pub
    view raw mrv2-cdh4 hosted with ❤ by GitHub
and the same process! 
Happy Learning! :)