Tuesday, April 9, 2013

Hadoop installation in AWS EC2

I was using the CDH's vm image on my local for a hands-on experience with Hadoop. I thought let's try it out in AWS and see how smooth is the process.

So, I started following the cloudera's blog post on the same. But the blog had a lot of issues and things didn't work as outlined. I received a little more help from this blog.

So, here's the brief setup after creating the instance. As told in both the posts, I used whirr to install the cluster to avoid manual setup.

Step 1: Get the latest whirr binary
wget http://apache.cs.utah.edu/whirr/whirr-0.8.1/whirr-0.8.1.tar.gz
tar -xvf whirr-0.8.1.tar.gz
view raw wget hosted with ❤ by GitHub
Step 2: Setup the whirr config file. You can copy the below contents and update the AWS Access Key ID and Secret Access Key accordingly.
vi hadoop.properties
________________________________________________________________________________________________________________________
whirr.cluster-name=<cluster_name_all_lowercase>
whirr.instance-templates=1 hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver,2 hadoop-datanode+yarn-nodemanager
whirr.provider=aws-ec2
whirr.identity=<AWS Access Key ID>
whirr.credential=<AWS Secret Access Key>
whirr.private-key-file=/home/ubuntu/.ssh/id_rsa
whirr.public-key-file=/home/ubuntu/.ssh/id_rsa.pub
whirr.env.mapreduce_version=2
whirr.env.repo=cdh4
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop
whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory
whirr.yarn.configure-function=configure_cdh_yarn
whirr.yarn.start-function=start_cdh_yarn
whirr.hardware-id=m1.large
whirr.image-id=eu-west-1/ami-81c5fdf5
whirr.location-id=eu-west-1
________________________________________________________________________________________________________________________
view raw hadoop_prop hosted with ❤ by GitHub
Step 3: Install java
sudo apt-get update
sudo apt-get install openjdk-6-jre-headless
view raw jdk-install hosted with ❤ by GitHub
Step 4: Generate public key. I just entered on the first prompt.
ssh-keygen -t rsa -P ''
_______
Enter file in which to save the key (/home/ubuntu/.ssh/id_rsa): <Press Enter>
Your identification has been saved in /home/ubuntu/.ssh/id_rsa.
Your public key has been saved in /home/ubuntu/.ssh/id_rsa.pub.
The key fingerprint is:
_______
view raw keygen hosted with ❤ by GitHub
Step 5: Launch cluster. Wait till you get the instruction to ssh to the nodes.
bin/whirr launch-cluster --config hadoop.properties
view raw whirr-instal hosted with ❤ by GitHub
Step 6: ssh to the nodes.
ssh -i /home/ubuntu/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no ubuntu@xx.xxx.xx.xx
view raw ssh hosted with ❤ by GitHub
Step 7: verify if hadoop installation works. We'll look more on this later.