Tuesday, March 13, 2012

A Short Script for Logging into Interactive Elastic MapReduce Clusters

Elastic MapReduce is great, but the latencies can be painful.  For me, this is especially true when I'm in the early stages of developing a new job and need to make the transition from code on my local machine to code running in the cloud -- the ~5 minute period between starting up a cluster and actually being able to log on to it is too long to sit there staring at a blank screen and too short to effectively context switch to something else in a useful way.

My current solution is to allow myself to get distracted but to drag myself back to my EMR session as soon as it's available.  Adding some simple polling plus a sticky growl notification to my interactive-emr-startup script does the trick quite nicely:


#!/bin/bash


if [ -z "$1" ]; then
  echo "Please specify a job name"
  exit 1
fi


elastic-mapreduce \
  (... with all of my favorite options ...) \
| tee ${TMP_FILE}


JOB_ID=`cat ${TMP_FILE} | awk '{print $4}'`
rm ${TMP_FILE}


# poll for WAITING state
JOB_STATE=''
MASTER_HOSTNAME=''
while [ "${JOB_STATE}" != "WAITING" ]; do
  sleep 1
  echo -n .
  RESULT=`elastic-mapreduce --list | grep ${JOB_ID}`
  JOB_STATE=`echo $RESULT | awk '{print $2}'`
  MASTER_HOSTNAME=`echo $RESULT | awk '{print $3}'`
done
echo Connecting to ${MASTER_HOSTNAME}...


growlnotify -n "EMR Interactive" -s -m "SSHing into ${MASTER_HOSTNAME}"


ssh $MASTER_HOSTNAME -i ~/.ssh/emr-keypair -l hadoop -L 9100:localhost:9100


One of my personal productivity goals for the year is finding little places like this that I can optimize with a short script.  This particular one has rescued me from the clutches of HN more than once!

On Code Reviews and Developer Feedback

There's a great post from last week at 37signals, Give it five minutes:

While he was making his points on stage, I was taking an inventory of the things I didn’t agree with. And when presented with an opportunity to speak with him, I quickly pushed back at some of his ideas. I must have seemed like such an asshole.

His response changed my life. It was a simple thing. He said “Man, give it five minutes.” I asked him what he meant by that? He said, it’s fine to disagree, it’s fine to push back, it’s great to have strong opinions and beliefs, but give my ideas some time to set in before you’re sure you want to argue against them. “Five minutes” represented “think”, not react. He was totally right. I came into the discussion looking to prove something, not learn something.

There’s also a difference between asking questions and pushing back. Pushing back means you already think you know. Asking questions means you want to know. Ask more questions.

This is such a great outlook and a great way to approach the discussion of feedback for code reviews and design reviews.

It's surprising how little time development teams devote to training, or even internal discussion on effective feedback. As developers, we are constantly engaged in this kind of communication: white-boarding sessions, spec reviews, design reviews, code reviews. We're expected to give and receive feedback on a daily basis, but few of us are properly prepared for it. Not only do we lack the training, but we have many negative examples to draw from. Who hasn't been a part of a design review where tempers flare? Properly giving feedback is something that requires constant attention and practice. Receiving feedback can be just as difficult.

Culture of Communication

One of the major pillars of our engineering culture at bizo is "the 3 Cs": Communication, Communication, Communication.

We've tried hard to build a team of engineers that are eager to receive feedback, humble about their abilities, objective and gracious with their feedback, and freely giving of their own knowledge and experience. We see communication as a prerequisite for building a world-class team and developing high-quality code. You often hear the phrase "strong opinions, weakly held," and that is the kind of culture we have tried to build.

Communication is hard. It takes real team agreement and commitment to continued work to keep this culture alive and well. It's important the team views effective communication as important and that the culture supports it.

Code Reviews

Code reviews are something that can easily be approached from the wrong perspective, both as an author or reviewer.

As a reviewer, it can be easy to jump in and argue, to try and push 'your' solution (even though it may be equivalent), to push back instead of asking questions and trying to understand.

As an author, it's far too easy to get attached to your code, to your specific solution/naming/etc. It's also easy to feel like each comment is an attack on your ability, and that by accepting the feedback, this somehow means that you were wrong or did a bad job. Of course, nothing could be further from the truth!

At Bizo, we perform code reviews for every change. They are a major part of our culture of communication. In order to perform effective code reviews, it's important to have some shared guidelines that help support effective communication.

Here are some guidelines we've found to be helpful for performing code reviews:

What is a code review

  • A careful line-by-line critique of code by peers
  • happens in a non-threatening context
  • goal is cooperation and mutual learning, not fault finding

Code reviews are a team exercise to improve understanding and make the code better!

When people think of code reviews they usually think of catching bugs. Code reviews do occasionally catch bugs or potential performance problems, but this is rare.

Just as important is fostering a shared understanding of the code and exposure to new approaches, techniques, and patterns. Seeing how your peers program is a great way to learn from them.

Ensuring coding standards and style guides is another way code reviews help. Working on a team it's important to keep readability and quality high using a shared vocabulary.

As an author

As an author, it's important to view each comment as a new opportunity to improve your code. Instead of jumping into defense mode, take a step back and think. Try to approach the code again for the first time with this new perspective. Your team has a lot of experience and varied backgrounds -- draw from them! They are there to help you. Use the gift of their experience and knowledge to improve the code.

Trust the team, and view all comments as action items. Some changes can seem arbitrary, especially when it comes to naming and organization. Unless there's a strong reason, tend to agree with your reviewers. If a reviewer finds something confusing, it is confusing! Code spends most of its life in maintenance and programming is a team sport. Remember that they are your audience, and you want them to be able to understand your code at 4am after a system crash.

As a reviewer

As a reviewer, it's important to take the time to understand the code, think, and ask questions to understand the code before providing feedback. The author probably spent a lot more time thinking about the problem and the approach over the course of the project.

Be strict on coding standard and style guide violations. The real cost of software is maintenance (80% according to Sun). It's important the code is easily understood by the team.

Be gentle on personal preferences. If it's not a standard violation and just a matter of personal preference, defer to the author. It's okay to present your perspective, but mention that it's just a preference and not meant to be taken as an action item.

Trust the author. It's often the case that there are many valid approaches to a problem. It's great to present alternative approaches and discuss pros/cons of various approaches. If you see alternative solutions, bring them up! When discussing alternatives, make sure to listen to the author. Remember they are the subject matter expert and you are working together on the same team.

It takes work!

Communication is hard! It's easy to screw-up. It's easy to go into attack or defense mode when you're passionate about what you're doing. It's really something we all need to remind ourselves to work on every day. It's something we need to periodically remind ourselves as a team. Try to view each review as an opportunity to practice these guidelines. Just remember to take a step back, think, and ask questions.

Monday, March 12, 2012

Fault Tolerant MongoDB on EC2

While working on a project at Bizo I needed to connect a Rails app to a MongoDB backend both of which run in Amazon's Cloud (EC2). At Bizo we have a policy to not use non Amazon services when possible (to limit risk) - so we normally run most of our services straight off of EC2. I'd like to share what I've learned as best practices throughout the experience as I hope it might save some time and frustration for others.

Primer

Replica sets are the preferred way to run a distributed, fault tolerant MongoDB service. But as with any distributed system, nodes will eventually fail. Now replica sets are pretty good at handling failures, but they can't save you if too many nodes fail.
Specifically a replica set requires a minimum of two nodes to function at all times (1 primary and 1 secondary node). Thus a good rule of thumb is to run **at least 3 nodes** in a replica set, that way if a node fails your database service doesn't go down with it. The Rails app I was working with doesn't experience enormous amounts of traffic so 3 m1.large (64bit) nodes were sufficient for my needs. What follows is a rundown of our setup and how it handles common needs of fault tolerant systems.

Best Practices


Minimize Failure with AutoScaling, Availability Zones, CloudWatch and EBS Volumes

  • Use Autoscaling Groups, CloudWatch and EBS Volumes to replace failed nodes as soon as they go down. Since we run three nodes, our replica set is insulated from failure due to a single node crashing. But if two nodes crash the replica set goes with them. To solve this we use Cloudwatch alarms to trigger the Autoscaling Group whenever a node goes down - that way a new replacement node is automatically brought online within a few minutes of a failure to reduce the risk of nodes sequentially failing. Additionally each node stores it's data on an EBS Volume (network attachable hard drive) - that way when a node fails, it's replacement doesn't startup with missing data - it simply mounts the previous node's EBS.
  • To protect against multiple nodes failing simultaneously run each node in a separate availability zone. The above isn't sufficient to protect against things like hardware failures as all 3 instances could wind up on the same hardware. Running each node in a separate availability zone guarantees that our mongo instances run with a reasonable amount of separation (eg. they don't all end up on the same hardware box). Ideally you'd run each node in its own region (separate data center), but this causes headaches trying to configure firewalls as Amazon does not allow security groups to be used across multiple regions (see security below). So unless you want to setup a VPN for cross region communication - you're probably better off just running in separate availability zones.
    Assuming you've created the group and are running three nodes, each in a separate availability zone you can configure the auto scaling group using Amazon's command line tools like so:
    
    as-update-auto-scaling-group my-mongo-service \
    --region us-east-1 \
    --availability-zones us-east-1a us-east-1b us-east-1c \
    --max-size 3 \
    --min-size 3 \
    --desired-capacity 3
    
    
    Now if any node fails then a new one will startup to take its place in the proper availability zone.

If Everything Fails, have backups

It's always good to have backups just in case something really bad happens. Fortunately since we use EBS Volumes this is really easy - we create nightly snapshots of the primary node's EBS Volume on our cron server using Amazon's command line tools (ec2-create-snapshot). These snapshots are persisted to S3 and we can easily restore our replica set from these backups.

Use Elastic IPs

As nodes fail and are replaced you want both your Replica set and your database clients to be able to find and connect to the new nodes. The easiest way to do this in Amazon is to use Elastic IPs - special* static ip addresses that can be assigned to individual instances. Since each instance runs in a separate availability zone we just need one Elastic IP per zone. When a new node starts up to replace a failed instance, it checks which zone it was started in and assigns itself the matching Elastic IP. Both the client and replica set configuration should point at the Elastic IP address - that way failures and startups of new nodes will be seamless to your app. This is because cross-security group openings in the firewall need to use internal (not external) addresses and the DNS url in the console will resolve to an internal ip address from an EC2 instance.

Security

This is where the headaches can start. Ideally you want to restrict access to your MongoDB instances to just your client application using Amazon's security groups. The way we normally set this up is to give your mongo instances a security group, say mongo-db-prod and your client app a security group, say cool-app-prod. Then mongo-db-prod would grant access on port 27017 (default mongodb port) to security group: cool-app-prod. Unfortunately what's not documented very well is that if you use the **external Elastic IP** addresses in your configuration it **will not work** with security groups! Instead you have to use the Elastic IPS DNS url (found in Amazon's web console) for security groups to work properly.

A Final Caveat

One thing to be careful of is if you require more than 5 nodes in a replica set you'll run into a problem using Elastic IPs - Amazon by default only allows 5 eips per region. You'll need to either ask Amazon to increase this limit on your account or seek out an alternative setup.
Well, that's it but If you have another setup for running MongoDB on EC2 I'd love to here it. Until next time.