* Techie(S)pArK *: s3cmd

Monday, 15 April 2013

Monitoring S3 uploads for a real time data

If you are working on Big Data and its bleeding edge technologies like Hadoop etc., the primary thing you need is a "dataset" to work on. So, this data can be reviews, blogs, news, social media data (Twitter, Facebook etc), domain specific data, research data, forums, groups, feeds, fire hose data etc. Generally, companies reach the data vendors to fetch such kind of data.

Normally, these data vendors dump the data into a shared server kind of environment. For us to use this data for processing with MapReduce and so forth, we move them to S3 for storage first and processing next. Assume, the data belong to social media such as Twitter or Facebook, then the data can be dumped according to the date format directory. Majority of the cases, its the practice.
Also assuming 140-150GB /day being dumped in a hierarchy like 2013/04/15 ie. yyyy/mm/dd format, stream of data, how do you
- upload them to s3 in the same hierarchy to a given bucket?
- monitor the new incoming files and upload them?
- save the space effectively on the disk?
- ensure the reliability of uploads to s3?
- clean if the logging is enabled to track?
- re-try the failed uploads?

These were some of the questions, running at the back of my mind, when I wanted to automate the uploads to S3. Also, I wanted 0 human intervention or at-least the least!
So, I came up with
- s3sync / s3cmd.
- the python Watcher script by Greggory Hernandez, here https://github.com/greggoryhz/Watcher
A big thanks! This helped me with monitoring part and it works so great!
- few of my own scripts.

What are the ingredients?

Installation of s3sync. I have just used one script of s3cmd here and not s3sync in real. May be in future -- so I have this.

Installation of Watcher.

My own wrapper scripts.
cron

Next, having set up of the environment ready, lets make some common "assumptions".

Data being dumped will be at /home/ubuntu/data/ -- from there it could be 2013/04/15 for ex.
s3sync is located at /home/ubuntu
Watcher repository is at /home/ubuntu

Getting our hands dirty...

Goto Watcher and set the directory to be watched for and corresponding action to be undertaken.

Create a script called monitor.sh to upload to s3 in s3sync directory as below.

The variables you may like to change is s3bucket path in "s3path" in monitor.sh
This script will upload the new incoming file detected by the watcher script in the reduced redundancy storage format. (you can remove the header -- provided you are not interested to store in RRS format)
The script will call s3cmd ruby script to upload recursively and thus maintains the hierarchy ie. yyyy/mm/dd format with files *.*
It will delete the file successfully uploaded to s3 from the local path -- to save the disk space.
The script would not delete the directory, as it will be taken care by yet another script re-upload.sh, which acts as a backup for the failed uploads to be uploaded again to s3.

Create a script called re-upload.sh which will upload the failed file uploads.

This script ensures that the files that are left over from monitor.sh (failed uploads -- this chance is very less. May be 2-4 files/day. -- due to various reasons.), will be uploaded to s3 again with the same hierarchy in RRS format.
Post successful upload, deletes the file and hence the directory if empty.

Now, more dirtiest work -- Logging and cleaning logs.

All the "echo" created in monitor.sh can be found in ~/.watcher/watcher.log when the watcher.py is running.
This log helps us initially and may be later too, to backtrack errors or so.
Call of duty - Janitor for cleaning logs. To do this, we can use cron to run a script at sometime. I was interested to run - Every Saturday at 8.00 AM
Create a script to clean log as "clean_log.sh" in /home/ubuntu/s3sync

Time for cron

All set! logging clean happens every Saturday 8.00 AM and re-upload script runs for the previous day, to check if files exist and does the cleaning accordingly.

Let's start the script

So, this assures successful uploads to S3.

My bash-fu with truth! ;)

Happy Learning! :)

Tuesday, 18 December 2012

FUSE on Amazon S3

FUSE: File System In User Space, hosted on sourceforge, a well known open source project http://fuse.sourceforge.net/
You either put the files in S3 bucket directly or in the mount point, both will always be in the same hierarchy and in Sync. The best thing is that any arbitrary program can just point to this mount point and perform simple/ normal commands, rather than file system specific commands.

Here is a small documentation about how we can achieve this.

1. Check out the code from google code.

$ svn checkout http://s3fs.googlecode.com/svn/trunk/ s3fs

2. Switch to the working directory

$ cd s3fs

$ ls

AUTHORS autogen.sh ChangeLog configure.ac COPYING doc INSTALL Makefile.am NEWS README src test

3. Now same old ritual of configure , make and install.
To run the subsequent command you need autoconf. So make sure you have it by running the following command.

$ sudo apt-get install autoconf
$ autoreconf --install

It is silently notifying you that you lack the libraries. Time to get them installed...

$ sudo apt-get install build-essential libfuse-dev fuse-utils libcurl4-openssl-dev libxml2-dev mime-support

Getting back...

$ ./configure --prefix=/usr

$ make

$ sudo make install

4. Done with the Installation process.
Cross-check:

$ /usr/bin/s3fs

s3fs: missing BUCKET argumentUsage: s3fs BUCKET:[PATH] MOUNTPOINT [OPTION]...

5. Add the following line to your ~/.bashrc file and source it.

export s3fs=/usr/bin/s3fs

$source ~/.bashrc$ s3fs s3fs: missing BUCKET argumentUsage: s3fs BUCKET:[PATH] MOUNTPOINT [OPTION]...

6. Install s3cmd. Many of you must be using this tool to interact with s3.

$ sudo apt-get install s3cmd$ s3cmd --configure

This will configure with the S3 account using Access and Secret Key.

Configuring FUSE
1. First set use_allow_other for others to use. Uncomment in fuse.conf

$ vi /etc/fuse.conf

2. Set the AcessKey:SecretKey in the format in passwd-s3fs file

$ sudo vi /etc/passwd-s3fs

$ sudo chmod 640 /etc/passwd-s3fs

3. Created a bucket called "s3dir-sync" for this experiment.

$ s3cmd ls2012-12-18 09:23 s3://s3dir-sync

4. Creating a mount point where you want to dump/place the files and keep them in sync with the S3 bucket. Create as root user.

$ sudo mkdir -p /mnt/s3Sync$ sudo chmod 777 /mnt/s3Sync

5. With s3fs, as a root user.

$ sudo s3fs s3dir-sync -o default_acl=public-read -o allow_other /mnt/s3Sync/

Cross-check:

$ mount -ls3fs on /mnt/s3Sync type fuse.s3fs (rw,nosuid,nodev,allow_other)

If you try mounting again, you will get the following Warning

mount: according to mtab, s3fs is already mounted on /mnt/s3Sync

6. I created a directory structure of
/mnt/s3Sync/
-> 2012/12/18$ more test.txt
This is a check file to sync with the s3dir-sync.
Blah..!

The same is synced in the bucket "s3dir-sync"
Cross-Check:

$ s3cmd ls s3://s3dir-sync

DIR s3://s3dir-sync/2012/

2012-12-18 09:57 0 s3://s3dir-sync/2012

Happy Learning! :)