Tuesday, March 13, 2012

A Short Script for Logging into Interactive Elastic MapReduce Clusters

Elastic MapReduce is great, but the latencies can be painful.  For me, this is especially true when I'm in the early stages of developing a new job and need to make the transition from code on my local machine to code running in the cloud -- the ~5 minute period between starting up a cluster and actually being able to log on to it is too long to sit there staring at a blank screen and too short to effectively context switch to something else in a useful way.

My current solution is to allow myself to get distracted but to drag myself back to my EMR session as soon as it's available.  Adding some simple polling plus a sticky growl notification to my interactive-emr-startup script does the trick quite nicely:


#!/bin/bash


if [ -z "$1" ]; then
  echo "Please specify a job name"
  exit 1
fi


elastic-mapreduce \
  (... with all of my favorite options ...) \
| tee ${TMP_FILE}


JOB_ID=`cat ${TMP_FILE} | awk '{print $4}'`
rm ${TMP_FILE}


# poll for WAITING state
JOB_STATE=''
MASTER_HOSTNAME=''
while [ "${JOB_STATE}" != "WAITING" ]; do
  sleep 1
  echo -n .
  RESULT=`elastic-mapreduce --list | grep ${JOB_ID}`
  JOB_STATE=`echo $RESULT | awk '{print $2}'`
  MASTER_HOSTNAME=`echo $RESULT | awk '{print $3}'`
done
echo Connecting to ${MASTER_HOSTNAME}...


growlnotify -n "EMR Interactive" -s -m "SSHing into ${MASTER_HOSTNAME}"


ssh $MASTER_HOSTNAME -i ~/.ssh/emr-keypair -l hadoop -L 9100:localhost:9100


One of my personal productivity goals for the year is finding little places like this that I can optimize with a short script.  This particular one has rescued me from the clutches of HN more than once!

No comments: