bizo developer blog

SCM Migration

2013-08-26T13:44:00.000-07:00

We happily used Atlassian’s hosted OnDemand service for source code management with the following setup

Subversion: source control management
FishEye: source code browsing
Crucible: code reviews
Jenkins (hosted on EC2): continuous integration and periodic jobs (http://dev.bizo.com/2009/11/using-hudson-to-manage-crons.html)

However, Atlassian is ending their OnDemand offering for source code management in October so it was time for a change. The good news: we were wanting to migrate to git anyway. The bad news: we had around hundreds projects in our subversion repository and needed to break them up into separate git repositories.

We switched on a Thursday morning with minimal developer interruptions, now we're on a new setup

Bitbucket: source control management and code browsing
Crucible (hosted on EC2): code reviews
Jenkins (hosted on EC2): continuous integration and periodic jobs

How'd we do it? Read on, my friend.

Problem

Move hundreds of projects (some with differing branching structures) to an equivalent number of git repositories. And change hundreds of Jenkins job configurations from pulling code out of subversion to pulling code from git. And set up a new Crucible instance for code reviews for the hundreds of repos. All without disrupting the dev team's work. For subversion, this meant moving the code, including branches, and commit history from subversion into Bitbucket. For Jenkins, it meant changing the job configs to point at the equivalent git repository with the same code and branch as the old subversion configuration. This blog post focuses on the subversion to git migration. Fixing the Jenkins configs will be covered in a later blog post.

Subversion to Git

Converting a single repository from subversion to git is fortunately straight forward due to the terrific tool git-svn (https://www.kernel.org/pub/software/scm/git/docs/git-svn.html).

The challenging part was determining how each project configured branches. In subversion, branches are just another subdirectory the repository. Basically any level of the directory hierarchy can support branches. You can pretty much put them anywhere. Git, however, only supports branches at the root of the repository. Git-svn allows you to tell git what directory the branches are in, but first you have to find what directory that is.

Our subversion repositories followed two primary branching structures: branch at the module level or branch at the project level.

One layout that I will call "module level". Module level projects had a separate branch point for each module in the project. These projects were usually several loosely connected modules that could be deployed separately or libraries that were related but could be imported independently. Module level projects looked like this:

- svn/<project>/trunk/<module1>

- svn/<project>/trunk/<module2>

- svn/<project>/branches/<module1>/<branch1_for_module1>

- svn/<project>/branches/<module1>/<branch2_for_module1>

- svn/<project>/branches/<module2>/<branch1_for_module2>

"module level" projects mapped into a separate git repo for each module using this git-svn command:

git svn clone <svn_root> --trunk <project>/trunk/<module> --branches <project>/branches/<module> --tags <project>/branches/<tags> <module>

The other branching structure I’ll call "project level". These projects also had multiple modules, but the branches were defined such that each branch contained the entire project. These projects were usually separate modules for the domain layer, application layer and web layer or closely related applications that use the same database. Parts could perhaps be deployed separately but they often need to be deployed at the same time such as when the database schema changed. Project level projects looked like this:

- svn/<project>/trunk/<module1>

- svn/<project>/trunk/<module2>

- svn/<project>/branches/<branch1>/<module1>

- svn/<project>/branches/<branch1>/<module2>

- svn/<project>/branches/<branch2>/<module1>

- svn/<project>/branches/<branch2>/<module2>

"project level" projects mapped into a single git repo containing all modules using a git-svn command:

git svn clone <svn_root> --trunk <project>/trunk --branches <project>/branches --tags <project>/branches <project>

To automate the git-svn clones, I wrote a ruby script that used "svn ls" to find the list of all projects. Each project was assumed to be "module level" unless it was in a hard-coded list of known "project level" projects. It was important for this to be fully automated as the list of "project level" projects was not complete until near the end of the migration. It took several tries to make sure the migration was correct. Some projects unfortunately used both branching structures, which is not supported by git-svn. Some of these branches were abandoned anyway, but others were moved using "svn mv" to fit that project's standard branch structure.

Local Git to Bitbucket

Atlassian provided a jar (https://go-dvcs.atlassian.com/display/aod/Migrating+from+Subversion+to+Git+on+Bitbucket) to push a git-svn repository up to Bitbucket. The jar also can create an authors file from the subversion repository to map a subversion user to the values git needs for a committer - first name, last name and email address. This made scripting the Bitbucket upload for each repository straightforward. The jar also handles syncs to an existing Bitbucket repository so developers could continue committing to their svn projects and Bitbucket would automatically get updated. Note this only does fast forward syncs so the incremental sync stops working once commits were made directly to Bitbucket.

Crucible

Crucible is a tool to facilitate code reviews. It imports commits from your SCM tool, allows inline comments on the diffs and manages the code review life cycle of assigning reviewers, tracking who has approved the changes, and closing the review once approved. Crucible setup is fairly straightforward with a couple of caveats.

Crucible needs to access your repositories to pull in the commit history. There is no native support for pointing crucible at a Bitbucket team account and having Crucible automatically import each repository. There is an free add-on (https://marketplace.atlassian.com/plugins/com.atlassian.fecru.reposync.reposync) that works for an initial import, but initially it did not bring in new repositories that are added to the team account after the initial import. It turns out the update did not work because I was using a Bitbucket user that could not access the User list from the Bitbucket API. Changing the Bitbucket user to one with access to this API end point solves this problem. Incremental updates to the repository list are now working.

While Crucible supports ssh access to git repositories in general, I ran into the problem described here https://answers.atlassian.com/questions/34283/how-to-connect-to-bitbucket-from-fisheye. Basically, Crucible does not support Bitbucket's ssh URL format. Instead of using ssh, I had to use https to connect to the Bitbucket repositories. This means each repository configuration requires the Bitbucket username and password to be specified separately, which is not ideal.

Testing

After running git-svn clone on a few projects, I went ahead and pulled all the projects down with git-svn. The distributed nature of git helped testing because the entire repository could be represented locally without needing to upload it to any server to test the initial clones. However, cloning all the repositories took about 24 hours. During this time there was minimal CPU and I/O load so I multithreaded the cloning jobs using 16 threads. This improved the time to just 1.5 hours on only a dual core machine.

I was initially hesitant to upload all the repositories to Bitbucket because I did not want to have to manually delete the repos if there was a problem. However, I found the Bitbucket REST API (https://confluence.atlassian.com/display/BITBUCKET/Use+the+Bitbucket+REST+APIs). It is pretty well put together and was easy to use because it generally follows REST conventions. I've yet to find anything that can be done in the UI that can not be done in the API, which has been outstanding for adding additional niceties like adding commit hooks to push changes to crucible for each repository. For the purposes of migration, the best feature was deleting repositories. Knowing I could automatically clean up any mistakes provided the confidence to just let it rip. I actually ended up using this to clean up two false start migrations:

git-svn has a "show-ignore" command to translate files ignored by subversion into a .gitignore file. I initially added .gitignore to the git repositories. However, this meant every repository had a commit in Bitbucket and so would no longer accept changes from subversion. This was resolved by adding .gitignore to subversion before the conversion.
the first authors file I created was missing a few users. This was not discovered by noticing the Bitbucket commit history did not look as nice. It was nice to be able to just wipe it all out with a single command, fix the authors file, and redo the upload with a single command.

Post-migration

The time following the initial migration was when the automation really came in handy. A couple developers were out of the office during the cut over. They were able to make commits of their local work to subversion and then I could re-sync just those repositories even after other developers had begun working on other repositories in Bitbucket. This went very smoothly with no hand wringing or diff patching required to make sure local work was not lost.

Wrap-up

Overall the migration went off with no hiccups. We're still tweaking our preferred settings for git pushes and pulls to get to our ideal workflow, but we're happy to be using Bitbucket. Crucible does not integrate with Bitbucket as nicely as it did with subversion in our old setup. Hopefully Atlassian will continue to make improvements to this integration as we really like the Crucible code review workflow. I'm always impressed how automation begets automation. Once you've taken the step of automating part of the process, it is so much easier to see the next step. We are already seeing some benefits from the time spent interacting with the Bitbucket API as we're now able to add and modify commit hooks on all the repositories easily.

Using AWS Custom SSL Domain Names for CloudFront

2013-06-20T11:41:00.001-07:00

AWS recently announced the limited availability of Custom SSL Domain Names for CloudFront. You have to request an invitation in order to start using it but I am guessing it won't be long until it has been rolled out to all customers.

We've been asking/waiting for Custom SSL on CloudFront for years and were excited when it finally came out. The sign up was easy and we were approved a day or two later.

Existing Setup

Our main use case for Custom SSL on CloudFront involves replacing a service that proxies secure requests to our non-secure CloudFront distro. We proxy secure requests because we didn't want the secure CloudFront domain leaking out to our customers for various reasons including:

We wanted to be able to point the domain elsewhere if we needed to
We wanted to keep our branding consistent on domains.

It basically looks like the following diagram:

The problem with having a proxy is two fold:

We have to operate that proxy which goes against our general rule to "never operate services when AWS can do it for you"
We get subpar performance relative since requests are no longer served from a distributed geo-located CDN.

But we needed the flexibility and branding mentioned above so we dealt with it. Not anymore...

Migrating to Custom SSL Domain Names for CloudFront

Once we got approval for custom SSL, the migration was pretty straightforward. I am not going to regurgitate the detailed documentation but will summarize the process.

Upload your SSL cert and make sure path starts with "/cloudfront" (This was annoying because we couldn't reuse our existing certificates that we were already using for ELBs)
Update your CF distro (I did so via the AWS Console):

add the domain name you want to support (e.g. secure-example.bizographics.com from above)
choose the SSL cert that you uploaded in the first step
Save

Wait for the CF distro to redeploy the configuration change
Update your Route53 DNS to point at the CF CNAME rather than the ELB endpoint
Wait for DNS to Update
Shut down ELB of Proxy

As you can see this was pretty easy. Most of the time was spent waiting for the CF distro the re-deploy (10s of minutes max) and DNS to update (which can take several days).

All-in-all, the minor annoyance of having two copies of the same SSL cert was worth the win of not having to operate the proxy and getting better performance for our customers. Check out the graph below showing the improved performance:

Note on Cost

The cost of custom SSL on CF seems ok but could be better and the wording is not totally clear: "You pay $600 per month for each custom SSL certificate associated with one or more CloudFront distributions." We have the same cert setup for multiple CF distros but I am not sure if we will be charged $600 for each disto using the cert or $600 for each cert regardless of how many distros are using it. (Will try to get clarification...) AWS claims the pricing is comparable to other similar offerings. That doesn't seem to jive with their usual practice of driving costs much lower but is livable for now.

Scala Command-Line Hacks

2013-04-22T13:28:00.001-07:00

Do you like command-line scripting and one-liners with Perl, Ruby and the like?

For instance, here's a Ruby one-liner that uppercases the input:

% echo matz | ruby -p -e '$_.tr! "a-z", "A-Z"'

MATZ

You like that kind of stuff? Yes? Excellent! Then I offer you a hacking idea for Scala.

As you may know, Scala offers similar capability with the -e command-line option but it's fairly limited in its basic form because of the necessary boilerplate code to set up iteration over the standard input... it just begs for a simple HACK!

Using a simple bash wrapper,

#!/bin/bash
#
# Usage: scala-map MAP_CODE
#
code=$(cat <<END
scala.io.Source.stdin.getLines map { $@ } foreach println
END
)
scala -e "$code"

then we can express similar one-liners using Scala code and the standard library:

% ls | scala-map _.toUpperCase

FOO

BAR

BAZ

...

% echo "foo bar baz" | scala-map '_.split(" ").mkString("-")'

foo-bar-baz

Nifty, right? Here's another script template to fold over the standard input,

#!/bin/bash
#
# Usage: scala-fold INITIAL_VALUE FOLD_CODE
#
# where the following val's can be used in FOLD_CODE:
#
# `acc` is bound to the accumulator value
# `line` is bound to the current line
#
code=$(cat <<END
println(scala.io.Source.stdin.getLines.foldLeft($1) { case (acc, line) => $2 })
END
)
scala -e "$code"

Now if you wanted to calculate the sum of the second column of space-separated input, you'd write:

$ cat | scala-fold 0 'acc + (line.split(" ")(1).toInt)'

foo 1

bar 2

baz 3
(CTRL-D)

You get the idea ... hopefully this inspires you to try a few things with Scala scripting templates!

Disclaimer: I am not advocating these hacks as replacement to learning other Unix power tools like grep, sed, awk, ... I am simply illustrating that Scala can be turned into an effective command-line scripting tool as part of the vast array of Unix tools. Use what works best for you.

Efficiency & Scalability

2013-04-19T11:54:00.003-07:00

Software engineers know that distributed systems are often hard to scale and many can intuitively point to reasons why this is the case by bringing up points of contention, bottlenecks and latency-inducing operations. Indeed, there exists a plethora of reasons and explanations as to why most distributed systems are inherently hard to scale, from the CAP theorem to scarcity of certain resources, e.g., RAM, network bandwidth, ...

It's said that good engineers know how to identify resources that may not appear to be relevant to scaling initially but will become more significant as particular kinds of demand grow. If that’s the case, then great engineers know that system architecture is often the determining factor in system scalability — that a system’s own architecture may be its worse enemy — so they define and structure systems in order avoid fundamental flaws.

In this post, I want to explore the relationship between system efficiency and scalability in distributed systems; they are to some extent two sides of the same coin. We’ll consider specifically two common system architecture traits: replication and routing. Some of this may seem obvious to some of you but it’s always good to back intuition with some additional reasoning.

Before we go any further, it’s helpful to formulate a definition of efficiency applicable to our context:

efficiency is the extent to which useful work is performed relative to the total work and/or cost incurred.

We’ll also use the following definition of scalability,

scalability is the ability of a system to accommodate an increased workload by repeatedly applying a cost-effective strategy for extending a system’s capacity.

So, scalability and efficiency are both determined by cost-effectiveness with the distinction that scalability is a measure of marginal gain. Stated differently, if efficiency decreases significantly as a system grows, then a system is said to be non-scalable.

Enough rambling, let’s get our thinking caps on! Since we’re talking about distributed systems, it’s practically inevitable to compare against traditional single-computer systems, so we’ll start with a narrow definition of system efficiency:

average work for processing a request on a single computer Efficiency = ———————————————————————————————————————
average work for processing a request in distributed system

This definition is a useful starting point for our exploration because it abstracts out the nature of the processing that’s happening within the system; it’s overly simple but it allows us to focus our attention on the big picture.

More succinctly, we’ll write:

(1) Efficiency = Wsingle / Wcluster

Replication Cost

Many distributed systems replicate some or all of the data they process across different processing nodes (to increase reliability, availability or read performance) so we can model:

(2) Wcluster = Wsingle + (r x Wreplication)

where r is the number of replicas in the system and Wreplication is the work required to replicate the data to other nodes. Wreplication is typically lower than Wsingle, though realistically they have different cost models (e.g., Wsingle may be CPU-intensive whereas Wreplication may be I/O-intensive). If n is the number of nodes in the system, then r may be as large as (n-1), meaning replicating to all other nodes, though most systems will only replicate to 2 or 3 other nodes — for good reason — as we’ll discover later.

We’ll now define the replication coeffient, which expresses the relative cost of replication compared to the cost of processing the request on a single node:

(3) Qreplication = Wreplication / Wsingle

Solving for Qreplication, we get:

(4) Wreplication = Qreplication x Wsingle

If we substitute Wreplication in (2) by the equation formulated in (4), we obtain:

(5) Wcluster = Wsingle x [ 1 + ( r x Qreplication * Wsingle ) ]

We now factor out Wsingleon the left side:

(6) Wcluster = Wsingle x [ 1 + r * Qreplication ]

Taking the efficiency equation (1) and substituting Wcluster from (6), the equation becomes:

(7) Efficiency = Wsingle / [ Wsingle x ( 1 + r * Qreplication ]

We then simplify Wsingle to obtain the final efficiency for a replicating distributed system:

(8) Efficiency (replication) = 1 / [ 1 + (r x Qreplication) ]

As expected, both r and Qreplication are critical factors determining efficiency.

Interpreting this last equation and assuming Qreplication is a constant inherent to the system’s processing, our two takeaways are:

If the system replicates to all other nodes (i.e., r = n - 1) it becomes clear that the efficiency of the system will degrade as more nodes are added and will approach zero as n becomes sufficiently large.

To illustrate this, let's assume Qreplication is 10%,

Efficiency (r = 1, n = 2) = 91%

Efficiency (r = 2, n = 3) = 83%

Efficiency (r = 3, n = 4) = 76%

Efficiency (r = 4, n = 5) = 71%

Efficiency (r = 5, n = 6) = 67%

...

In other words, fully-replicated distributed systems don't scale.

For a system to scale, the replication factor should be a (small) constant.

Let's illustrate this with Qreplication fixed at 10% and using a replication factor of 3,

Efficiency (r = 3, n = 4) = 76%

Efficiency (r = 3, n = 5) = 76%

Efficiency (r = 3, n = 6) = 76%

Efficiency (r = 3, n = 7) = 76%

Efficiency (r = 3, n = 8) = 76%

...

As we can see, fixed-replication-factor distributed systems scale — although, as you might expect, they do not exhibit the same efficiency as a single-node system. At worse, the efficiency will be 1/r — as you would intuitively expect.

Routing Cost

When a distributed system routes requests to nodes holding the relevant information (e.g., a partially replicated system, r < n) its working model may be defined as,

(9) Wcluster = (r / n) * Wsingle + (n-r)/n * (Wrouting + Wsingle)

The above equation represents the fact that r out of n requests are processed locally whereas the remainer of the requests are routed and processed on a different node.

Let’s define the routing coefficient to be,

(10) Qrouting = Wrouting / Wsingle

Solving for Wrouting in (9) by (11) to obtain,

(12) Wcluster = (r/n) * Wsingle + (n-r)/n * [ (Qrouting * Wsingle) + Wsingle ]

and taking the efficiency equation (1), substituting Wcluster from (12), the simplified equation becomes:

(13) Efficiency (routing) = n / [ n + (n - r) * Qrouting ]

Looking at this last equation, we can infer that:

As the system grows and n goes towards infinity, the efficiency of the system can be expressed as 1 / (1 + Qrouting). The efficiency is not dependent on the actual number of nodes within the system therefore routing-based systems generally scale.(But you knew that already)

If the number of nodes is large compared to the replication factor (n >> r) and Qrouting is significant (1.0, same cost as Wsingle), then the efficiency is ½, or 50%. This matches the intuition that the system is routing practically all requests and therefore spending half of its efforts on routing. The system is scaling linearly but it’s costing twice as much to operate (for every node) compared to a single-node system.

If the cost of routing is insignificant (Qrouting = 0), the efficiency is 100%. That’s right, if it doesn’t cost anything to route the request to a node that can process it, the efficiency is the same as a single-node system.

Let’s consider a practical distributed system with 10 nodes (n = 10), a replication factor of 3 (r = 3), and a relative routing cost of 10% (Qrouting = 0.10). This system would have an efficiency of 10 / 10 + (7 * 10%) = 93.46%. As you can see, routing-based distributed systems can be pretty efficient if Qrouting is relatively small.

Where To Now?

Well, this was a fun exploration of system scalability in the abstract. We came up with interesting equations to describe the scalabilty of both data-replicating and request-routing architectures. With some thinkering, these can serve as a good basis for reasoning about some of your distributed systems.

In real life, however, there are many other aspects to consider when scaling systems. In fact, it often feels like a whack-a-mole hunt; you never know there the next performance non-linearity is going to rear its ugly head. But if you use either (or both) the data-replicating and request-routing style architecture with reasonable replication factors and you manage to keep your replication/routing costs well below your single-node processing costs, you may find some comfort in knowing that at least you haven’t introduced a fundamental scaling limitation unto your system.

PS: With apologies for the formatting of the formulas ... Blogger wasn't exactly friendly with my equations imported from Google Docs so I had to go down the ASCII route. Thanks for reading and making it through!

2013-04-15T07:59:00.000-07:00

Sensible Defaults for Apache HttpClient

Defaults for HttpClient

Before coming to Bizo, I wrote a web service client that retrieved daily XML reports over HTTP using the Apache DefaultHttpClient. Everything went fine until one day the connection simply hung forever. We found this odd because we had set the connection timeout. It turned out we also needed to set the socket timeout (HttpConnectionParams.SO_TIMEOUT). The default for both connection timeout (max time to wait for a connection) and socket timeout (max time to wait between consecutive data packets) is infinity. The server was accepting the connection but then not sending any data so our client hung forever without even reporting any errors. Rookie mistake, but everyone is a rookie at least once. Even if you are an expert with HttpClient, chances are there will be someone maintaining your code in the future who is not.

Another problem with defaults using HttpClient is with PoolingClientConnectionManager. PoolingClientConnectionManager has two attributes: MaxTotal and MaxPerRoute. MaxTotal is the maximum total number of connections in the pool. MaxPerRoute is the maximum number of connections to a particular host. If the client attempts to make a request and either of these maximums have been reached, then by default the client will block until a connection is free. Unfortunately the default for MaxTotal is 20 and the default MaxPerRoute is only 2. In a SOA, it is common to have many connections from a client to a particular host. The limit of 2 (or even 1) connections per host makes sense for a polite web crawler, but in a SOA, you are likely going to need a lot more. Even the 20 maximum total connections in the pool is likely much lower than desired.

If the client does reach the MaxPerRoute or the MaxTotal connections, it will block until the connection manager timeout (ClientPNames.CONN_MANAGER_TIMEOUT) is reached. This timeout controls how long the client will wait for a connection from the connection manager. Fortunately, if this timeout is not set directly, it will default to the connection timeout if that is set, which will prevent the client from queuing up requests indefinitely.

What would a better set of defaults be?

A good default is something that is "safe". A safe default for a connection timeout is long enough to not give up waiting when things are working normally, but short enough to not cause system instability when the is down. Unfortunately safe is context dependent. Safe for a daily data sync process and safe for an in thread service request handler are very different. Safe for a request that is critical to the correct functioning of the program is different than safe for a some ancillary logging that is ok to miss 1% of the time. A default for timeouts that is safe in all cases is not really possible.

Safe defaults for PoolingClientConnectionManager's MaxTotal and MaxPerRoute should be big enough that they won’t be hit unless there is a bug. New to version 4.2 is the fluent-hc API for making http requests. This uses a PoolingClientConnectionManager with defaults of 200 MaxTotal and 100 MaxPerRoute. We are using these same defaults for all our configurations.

Note that the fluent-hc API is very nice, but requires setting the connection timeouts on each request. This is perfect if you need to tune the settings for each request but does not provide a safety check against accidentally leaving the timeout infinite.

How can you help out a new dev implementing a new HTTP client?

If you can't have a safe default and the existing defaults are decidedly not safe, then it is best to require a configuration. We created a wrapper for PoolingClientConnectionManager that requires the developer to choose a configuration instead of letting the defaults silently take effect. One way to require a configuration is to force passing in the timeout values. However, it can be a hard to know the right values especially when stepping into a new environment. To help a developer implementing a new client at Bizo, we created some canonical configurations in the wrapper based on our experience working in our production environment on AWS. The configurations are:

Configuration

Connection timeout

Socket timeout

MaxTotal

MaxPerRoute

SameRegion

125 ms

125 ms

200

100

SameRegionWithUSEastFailover

1 second

1 second

200

100

CrossRegion

10 seconds

10 seconds

200

100

MaxTimeout

1 minute

5 minutes

200

100

Clients with critical latency requirements can use the SameRegion configuration and need to make sure they are connecting to a service in the same AWS region. Back end processes that can tolerate latency can use the MaxTimeout configuration. Now when a developer is implementing a new client, the timeouts used by other services are readily available without having to hunt through other code bases. The developer can compare these with the current use case and choose an appropriate configuration. Additionally, if we learn that some of these configurations need to be tweaked, then we can easily modify all affected code.

Commonly the socket timeout will need to be adjusted for a specific service. After a connection is established, a service will not typically start sending its response until it has finished whatever calculation was requested. This can vary greatly even for different parameters on the same service endpoint. The socket timeout will need to be set based on the expected response times of the service.

It is easy to miss a particular setting even if you know it is there. At Bizo, we are always looking for ways to solve a problem in one place. We are hopeful that this will eliminate any issues we have had with bad defaults in our HttpClients.

Map-side aggregations in Apache Hive

2013-02-18T14:07:00.000-08:00

When running large scale Hive reports, one error we occasionally run into is the following:

Possible error:
Out of memory due to hash maps used in map-side aggregation.

Solution:
Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;'

What’s going on is that Hive is trying to optimize the query by performing a map-side aggregation. This is a map-side optimization that does a partial aggregation inside of the mapper, which results in the mapper outputting fewer rows. In turn, this reduces the amount of information that Hadoop needs to sort and distribute to the reducers.

Let’s think about what the Hadoop job looks like with the canonical word count example.

In the word count example, the naive approach is for the mapper to tokenize each row of input and output the key-value pair (#{token}, 1). The Hadoop framework will sort these pairs by the tokens, and the reducer sums the values to produce the total counts for each token.

Using a map-side aggregation, the mappers would instead tokenize each row and store partial counts in an in-memory hash map. (More precisely, the mappers are storing each key with the corresponding partial aggregation, which is just a count in this case.) Periodically, the mappers will output the pairs (#{token}, #{token_count}). The Hadoop framework again sorts these pairs and the reducers sum the values to produce the total counts for each token. In this case, the mappers will each output one row for each token every time the map is flushed instead of one row for each occurrence of each token. The tradeoff is that they need to keep a map of all tokens in memory.

By default, Hive will try to use the map-side aggregation optimization, but it falls back to the standard approach if the hash map is not producing enough of a memory savings. After processing 100,000 rows (modifiable via hive.groupby.mapaggr.checkinterval), Hive will check the number of items in the hash map. If it exceeds 50% (modifiable via hive.map.aggr.hash.min.reduction) of the number of rows read, the map-side aggregation will be aborted.

Hive will also estimate the amount of memory needed for each entry in the hash map and flush the map to the reducers whenever the size of the map exceeds 50% of the available mapper memory (modifiable via hive.map.aggr.hash.percentmemory). This, however, is an estimate based on the number of rows and the expected size of each row, so if the memory usage is per row is unexpectedly high, the mappers may run out of memory before the hash map is flushed to the reducers.

In particular, if a query uses a count distinct aggregation, the partial aggregations actually contain a list of all values seen. As more distinct values are seen, the amount of memory used by the map will increase without necessarily increasing the number of rows of the map, which is what Hive uses to determine when to flush the partial aggregations to the reducers.

Whenever a mapper runs out of memory, a group by clause is present, and map-side aggregation is turned on, Hive will helpfully suggest that you reduce the flush threshold to avoid running out of memory. This will lower the threshold (in rows) of when Hive will automatically flush the map, but it may not help if the map size (in bytes) is growing independently of the number of rows.

Some alternate solutions include simply turning off map-side aggregations (set hive.map.aggr = false), allocating more memory to your mappers via the Hadoop configuration, or restructuring the query so that Hive will pick a different query plan.

For example, a simple
select count(distinct v) from tbl
can be rewritten as
select count(1) from (select v from tbl group by v) t.

This latter query will avoid using the count distinct aggregation and may be more efficient for some queries.

Reader Driven Development

2013-02-15T09:10:00.001-08:00

In this talk on Effective ML, Yaron Minsky talks about Reader Driven Development. That is, writing your code with the reader in mind. Making decisions that will make the code more easily read and understood by other developers down the line.

The interest of the reader always pushes in the direction of clarity, simplicity, and the ability to change the code later. In most real projects, code is read and changed many more times than it is written. The readers interest are paramount in that regard.

When writing code the interests of the reader and writer may be at odds, and when faced with a decision, always err in the direction of the reader. The reader is always right. Regardless of team size, it's helpful to program this way. Even code you've written yourself may not be as clear 6 months or a year later otherwise. Great perspective, and I think it fits in nicely with previous posts here on programming style and code reviews (tend to agree with your reviewers, they are the audience!).

Asanban: Lean Development with Asana and Kanban

2013-01-23T15:03:00.000-08:00

On Bizo's External Apps team (aka. 'xapps'), we've been using a Kanban system to manage our work. All of Bizo Engineering uses Asana to track tasks, which isn't specifically designed for Kanban. We've settled on a set of of conventions that we use in Asana which enable our Kanban system. These conventions also help us to track metrics like the average lead time from month to month.

Background

Kanban is a second-generation Agile software development methodology. The focus is on finding and fixing bottlenecks, as well as removing waste by limiting work-in-progress. (The "WIP" limits referenced in this post are the number of work items that are allowed to be in a particular stage of the system at one time.) Adopting a Kanban system has made things easier for engineers, increased efficiency, and is very popular with our Product Management folks as well. We are now focused on delivering value incrementally rather than specifying and implementing larger chunks of work. If you're interested in adopting Kanban, I recommend reading David Anderson's seminal book on the topic: Kanban: Successful Evolutionary Change for Your Technology Business

Conventions

Each stage of work in our value chain is a priority heading in Asana. The name of the priority header follows the convention: "{STEP NAME} ({WIP}):", eg. "Dev Ready (10):". The steps that are earliest in the value chain are at the bottom of our Asana project, with tasks moving upwards through each stage until they reach "Production (15):" at the top when the functionality described by the task has been delivered to production. Once Product Management has verified that the functionality described by a task is functioning correctly in production, they mark the task as complete. We use tags to represent work item types, although its fairly limited at present.

Metrics

One of the most basic metrics to track in a Kanban system is the average amount of lead time (the time it takes from when a task gets added to the input queue until value is delivered). I have created some tooling that allows us to accomplish this systematically. I'll first describe what it does, and then how you can use it.

The first piece of the tooling bulk loads task data from the Asana API into MongoDB. The API returns JSON and I just store JSON as-is in MongoDB, which works out well since MongoDB speaks JSON natively. One hiccup is that once tasks are archived in Asana, you can no longer obtain information about them through the API. Accordingly, the bulk load needs to be scheduled to run on a regular basis (in our case, every night) so that we don't lose information about archived tasks. Furthermore, we have a policy that tasks should not be archived until they have been completed for at least 24 hours, so that the bulk loader will always run at least once after a task has been completed before it gets archived. After loading the task data, the bulk loader will create data describing how much time each task spent in each state, as well as how long each task took (in days) to complete from start to finish (lead time).

The other piece is a Sinatra web service that runs a map-reduce against the lead time data created by the bulk loader and serves lead times by month as JSON. It can also aggregate by year or day (but I don't think aggregating by day is useful).

I have packaged up both of those pieces into a gem called "asanban", which you can use. The source code and instructions for installation and usage are here: https://github.com/patgannon/asanban

Pain Points

There are a couple of problems I've run into using Asana with a Kanban system. The first is that there's no way to enforce WIP limits. Users just have to be mindful of the limits shown in the priority headers. I have been thinking about writing a nightly report that uses the data created by the bulk loader to find violated WIP limits and send out emails, but I haven't gotten to it yet. (This tooling is essentially a hack day project at this point.) There is also no functionality to facilitate different classes of service (SLAs and WIP break-downs), but maybe those could be supported using the same kind of nightly report.

Another problem I've run into is that task sizes can be all over the place, which reduces the meaningfulness of the metrics. Some Kanban practitioners use hierarchical work items to address this kind variability in size. Stories can be grouped into epics and/or broken down into "grains". Asana does support sub-tasks, so I may recommend that we use those to break down large work items in the future, at which point the bulk loader would be modified to track metrics by sub-task (for tasks that have them, which would be assumed to be epics).

Next Steps

As we fine tune our Kanban process, we'll use these lead time metrics to verify that when we've made an adjustment (changing a WIP limit or adding a buffer, for example), that our performance improves. I'd like to have more metrics so that we can have even better insight into our system in the future. For example, I'd like to see the average time tasks spend in particular steps, the average amount of total WIP, as well as WIP in each step (shown over time) and failure load.

The first order of business moving forward on this will probably be an improved charting interface on top of the existing metrics. Also, it would be nice if the bulk loader used a scheduling library so that folks don't have to manually schedule it in cron. It also could use some automated tests!!! As I mentioned previously, I've just been working on this on hack days, so if there's something you'd like to see done soon, well... pull requests will be gladly accepted! :)

What Makes Spark Exciting

2013-01-21T12:03:00.001-08:00

At Bizo, we’re currently evaluating/prototyping Spark as a replacement for Hive for our batch reports.

As a brief intro, Spark is an alternative to Hadoop. It provides a cluster computing framework for running distributed jobs. Similar to Hadoop, you provide Spark with jobs to run, and it handles splitting up the job into small tasks, assigning those tasks to machines (optionally with Hadoop-style data locality), issuing retries if tasks fail transiently, etc.

In our case, these jobs are processing a non-trivial amount of data (log files) on a regular basis, for which we currently use Hive.

Why Replace Hive?

Admittedly, Hive has served us well for quite awhile now. (One of our engineers even built a custom “Hadoop on demand” framework for running periodic on-demand Hadoop/Hive jobs in EC2 several months before Amazon Elastic Map Reduce came out.)

Without Hive, it would have been hard for us to provide the same functionality, probably at all, let alone in the same time frame.

That said, it has gotten to the point where Hive is more frequently invoked in negative contexts (“damn it, Hive”) than positive.

Personally, I admittedly even try to avoid tasks that involve working with Hive. I find it to be frustrating and, well, just not a lot of fun. Why? Two primary reasons:

1. Hive jobs are hard to test

Bizo has a culture of excellence, and for engineering one of the things this means is testing. We really like tests. Especially unit tests, which are quick to run and enable a fast TDD cycle.

Unfortunately, Hive makes unit testing basically impossible. For several reasons:

Hive scripts must be run in a local Hadoop/Hive installation.

Ironically, very few developers at Bizo have local Hadoop installations. We are admittedly spoiled by Elastic Map Reduce, such that most of us (myself anyway) wouldn’t even know how to setup Hadoop off the top of our heads. We just fire up an EMR cluster.
Hive scripts have production locations embedded in them.

Both our log files and report output are stored in S3, so our Hive scripts end up with lots of “s3://” paths scattered throughout in them.

While we do run dev versions of reports with “-dev” S3 buckets, still relying on S3 and raw log files (that are usually in a compressed/binary-ish format) is not conducive to setting up lots of really small, simplified scenarios to unit test each boundary case.
Hive scripts do not provide any abstraction–they are just one big HiveQL file. This means its hard to break up a large report into small, individually testable steps.

Despite these limitations, about a year ago we had a developer dedicate some effort to prototyping an approach that would run Hive scripts within our CI workflow. In the end, while his prototype worked, the workflow was wonky enough that we never adopted it for production projects.

The result? Our Hive reports are basically untested. This sucks.

2. Hive is hard to extend

Extending Hive via custom functions (UDFs and UDAFs) is possible, and we do it all the time–but it’s a pain in the ass.

Perhaps this is not Hive’s fault, and it’s some Hadoop internals leaking into Hive, but the various ObjectInspector hoops, to me, always seemed annoying to deal with.

Given these shortcomings, Bizo has been looking for a Hive-successor for awhile, even going so far as to prototype revolute, a Scala DSL on top of Cascading, but had not yet found something we were really excited about.

Enter Spark!

We had heard about Spark, but did not start trying it until being so impressed by the Spark presentation at AWS re:Invent (the talk received the highest rating of all non-keynote sessions) that we wanted to learn more.

One of Spark’s touted strengths is being able to load and keep data in memory, so your queries aren’t always I/O bound.

That is great, but the exciting aspect for us at Bizo is how Spark, either intentionally or serendipitously, addresses both of Hive’s primary shortcomings, and turns them into huge strengths. Specifically:

1. Spark jobs are amazingly easy to test

Writing a test in Spark is as easy as:

class SparkTest {
  @Test
  def test() {
    // this is real code...
    val sc = new SparkContext("local", "MyUnitTest')
    // and now some psuedo code...
    val output = runYourCodeThatUsesSpark(sc)
    assertAgainst(output)
  }
}

(I will go into more detail about runYourCodeThatUsesSpark in a future post.)

This one liner starts up a new SparkContext, which is all your program needs to execute Spark jobs. There is no local installation required (just have the Spark jar on your classpath, e.g. via Maven or Ivy), no local server to start/stop. It just works.

As a technical aside, this “local” mode starts up an in-process Spark instance, backed by a thread-pool, and actually opens up a few ports and temp directories, because it’s a real, live Spark instance.

Granted, this is usually more work than you want to be done in an unit test (which ideally would not hit any file or network I/O), but the redeeming quality is that it’s fast. Tests run in ~2 seconds.

Okay, yes, this is slow compared to pure, traditional unit tests, but is such a huge revolution compared to Hive that we’ll gladly take it.

2. Spark is easy to extend

Spark’s primary API is a Scala DSL, oriented around what they call an RDD, or Resilient Distributed Dataset, which is basically a collection that only supports bulk/aggregate transforms (so methods like map, filter, and groupBy, which can be seen as transforming the entire collection, but no methods like get or take which assume in-memory/random access).

Some really short, made up example code is:

// RDD[String] is like a collection of lines
val in: RDD[String] = sc.textFile("s3://bucket/path/")
// perform some operation on each line
val suffixed = in.map { line => line + "some suffix" }
// now save the new lines back out
suffixed.saveAsTextFile("s3://bucket/path2")

Spark’s job is to package up your map closure, and run it against that extra large text file across your cluster. And it does so by, after shuffling the code and data around, actually calling your closure (i.e. there is no LINQ-like introspection of the closure’s AST).

This may seem minor, but it’s huge, because it means there is no framework code or APIs standing between your running closure and any custom functions you’d want to run. Let’s say you want to use SomeUtilityClass (or the venerable StringUtils), just do:

import com.company.SomeUtilityClass
val in: RDD[String] = sc.textFile("s3://bucket/path/")
val processed = in.map { line =>
  // just call it, it's a normal method call
  SomeUtilityClass.process(line) 
}
processed.saveAsTextFile("s3://bucket/path2")

Notice how SomeUtilityClass doesn’t have to know it’s running within a Spark RDD in the cluster. It just takes a String. Done.

Similarly, Spark doesn’t need to know anything about the code you use witin the closure, it just needs to be available on the classpath of each machine in the cluster (which is easy to do as part of your cluster/job setup, you just copy some jars around).

This seamless hop between the RDD and custom Java/Scala code is very nice, and means your Spark jobs end up reading just like regular, normal Scala code (which to us is a good thing!).

Is Spark Perfect?

As full disclosure, we’re still in the early stages of testing Spark, so we can’t yet say whether Spark will be a wholesale replacement for Hive within Bizo. We haven’t gotten to any serious performance comparisons or written large, complex reports to see if Spark can take whatever we throw at it.

Personally, I am also admittedly somewhat infutuated with Spark at this point, so that could be clouding my judgement about the pros/cons and the tradeoffs with Hive.

One Spark con so far is that Spark is pre-1.0, and it can show. I’ve seen some stack traces that shouldn’t happen, and some usability warts, that hopefully will be cleared up by 1.0. (That said, even as a newbie I find the codebase small and very easy to read, such that I’ve had several small pull requests accepted already–which is a nice consolation compared to the daunting codebases of Hadoop and Hive.)

We have also seen that, for our first Spark job, moving from “Spark job written” to “Spark job running in production” is taking longer than expected. But given that Spark is a new tool to us, we expect this to be a one-time cost.

More to Come

I have a few more posts coming up which explain our approach to Spark in more detail, for example:

Testing best practices
Running Spark in EMR
Accessing partitioned S3 logs

To see those when they come out, make sure to subscribe to the blog, or, better yet, come work at Bizo and help us out!

Grouping pageviews into visits: a Scala code kata

2012-09-26T06:00:00.000-07:00

The basic units of any website traffic analysis are pageviews, visits, and unique visitors. Tracking pageviews is simply a matter of counting requests to the server. Calculating unique visitors usually relies on cookies and unique identifiers. Visits, however, require a bit more work. For our purposes, a single visit is defined as a sequence of pageviews where the interval between pageviews is less than a fixed length like 15 minutes.

I thought that the problem of grouping pageviews into visits would make an interesting code kata. Here’s the statement of the problem that I worked from:

Given a non-empty sequence of timestamps (as milliseconds since the epoch), write a function that would return a sequence of visits, where each visit is itself a sequence of timestamps where each pair of consecutive timestamps is no more than N milliseconds apart.

As a starting point, I decided to take a straightforward procedural approach:

def doingItIteratively(pageviews: Seq[Long]): Seq[Seq[Long]] = {
val iterator = pageviews.sorted.iterator
val visits = ListBuffer[ListBuffer[Long]]()

var previousPV: Long = iterator.next
var currentVisit: ListBuffer[Long] = ListBuffer(previousPV)

for (currentPV <- iterator) {
   if (currentPV - previousPV > N) {
     visits += currentVisit
     currentVisit = ListBuffer[Long]()
   }

   currentVisit += currentPV
   previousPV = currentPV
}
visits += currentVisit

visits map (_.toSeq) toSeq
}

So, we simply iterate through the (sorted) events tracking both the current visit and the previous pageview. If the current pageview represents a new visit, push the previous visit into the list of all visits and start a new one. Then push the current pageview into the (potentially new) visit.

It actually felt a bit odd to write procedural code like this and ignore the functional parts of Scala. Using a fold cleans the code up a bit and gets rid of the mutable state.

def doingItByFolds(pageviews: Seq[Long]): Seq[Seq[Long]] = {
val sortedPVs = pageviews.sorted

(Seq[Seq[Long]]() /: sortedPVs) { (visits, pv) =>
   val isNewVisit = visits.lastOption flatMap (_.lastOption) map {
     prevPV => pv - prevPV > N
   } getOrElse true

   if (isNewVisit) {
     visits :+ Seq(pv)
   } else {
     visits.init :+ (visits.last :+ pv)
   }
}
}

Here, we’re starting with an empty list of visits and folding it over the sorted pageviews. At each pageview, we decide if we need to start a new visit. If so, we append a new visit containing the pageview to the accumulated visits. If not, we pop off the last visit, append the pageview, and put the last visit back on the tail of the accumulated visits.

One part that’s still a bit messy is comparing the current timestamp to the previous one. We can improve that by iterating through the intervals between pageviews instead of the actual pageviews.

def slidingThroughIt(pageviews: Seq[Long]): Seq[Seq[Long]] = {
val intervals = (0L +: pageviews.sorted).sliding(2)

(Seq[Seq[Long]]() /: intervals) {
   (visits, interval) =>
     if (interval(1) - interval(0) > N) {
       visits :+ Seq(interval(1))
     } else {
       visits.init :+ (visits.last :+ interval(1))
     }
}
}

Here, we’re prepending a “0L” timestamp (and assuming that none of the pageviews happened in the early 70s) and using the “sliding” method to pair each timestamp with the previous one.

So far, we’ve been using a sequence of pageviews as a visit. What happens if we add an explicit Visit type? This lets us convert all pageviews into Visits at the start, then focus on merging overlapping Visits. One nice benefit is that this is a map-reduce algorithm that can be easily parallelized instead of one that must sequentially iterate over the pageviews (either explicitly or with a fold).

case class Visit(start: Long, end: Long, pageviews: Seq[Long]) {
def +(other: Visit): Visit = {
   Visit(min(start,other.start), max(end, other.end),
         (pageviews ++ other.pageviews).sorted)
}
}

def doingItMapReduceStyle(pageviews: Seq[Long]): Seq[Visit] = {
pageviews.par map { pv =>
   Seq(Visit(pv, pv+N, Seq(pv))
} reduce { (visit1, visit2) =>
   val sortedVisits = (v1 ++ v2) sortBy (_.start)

   (Seq[Visit]() /: sortedVisits) { (visits, next) =>
     if (visits.lastOption map(_.end >= next.start) getOrElse false)
     {
       visits.init :+ (visits.last + visit)
     } else {
       visits :+ visit
     }
   }
}
}

The map-reduce solution is fun, but in a production system, I’d probably stick with the sliding variation and add a bit more flexibility to track actual pageview objects instead of just timestamps.

Using GROUP BYs or multiple INSERTs with complex data types in Hive.

2012-09-19T06:00:00.000-07:00

In any sort of ad hoc data analysis, the first step is often to extract a specific subset of log lines from our files. For example, when looking at a single partner’s web traffic, I often use an initial query to copy that partner’s data into a new table. In addition to segregating out only the data relevant to my analysis, I use this to copy the data from S3 into HDFS, which will make later queries more efficient. (Using maps as our log lines is how we support dynamic columns.)

create external table if not exists
original_logs(fields map<string,string>) location “...” ;

create table if not exists
extracted_logs(fields map<string,string>) ;

insert overwrite table extracted_logs
select * from original_logs where fields[“partnerId”] = 123 ;

If I’m doing this for multiple partners, it’s tempting to use a multiple-insert so Hadoop only needs to make one pass of the original data.

create external table if not exists
original_logs(fields map<string,string>) location “...” ;

create table if not exists
extracted_logs(fields map<string,string>)
partitioned by (partnerId int);

from original_logs
insert overwrite table extracted_logs partition (partnerId = 123)
select * from original_logs where fields[“partnerId”] = 123
insert overwrite table extracted_logs partition (partnerId = 234)
select * from original_logs where fields[“partnerId”] = 234

Unfortunately, in Hive 0.7.x, this query fails with the error message “Hash code on complex types not supported yet.” A multiple-insert statement uses an implicit group by, and Hive 0.7.x does not support grouping by complex types. This bug was partially addressed in 0.8, which added support for arrays and maps, but structs and unions are still not supported.

At an initial glance, it does look like adding this support should be straightforward. This could be a good candidate for our next open source day.

mdadm: device or resource busy

2012-07-07T12:44:00.001-07:00

I just spent a few hours tracking an issue with mdadm (Linux utility used to manage software RAID devices) and figured I'd write a quick blog post to share the solution so others don't have to waste time on the same.

As a short background, we use mdadm to create RAID-0 stripped devices for our Sugarcube analytics (OLAP) servers using Amazon EBS volumes.

The issue manifested itself as a random failure during device creation:

$ mdadm --create /dev/md0 --level=0 --chunk 256 --raid-devices=4 /dev/xvdh1 /dev/xvdh2 /dev/xvdh3 /dev/xvdh4
mdadm: Defaulting to version 1.2 metadata
mdadm: ADD_NEW_DISK for /dev/xvdh3 failed: Device or resource busy

I searched and searched the interwebs and tried every trick I found to no avail. We don't have dmraid installed on our Linux images (Ubuntu 12.04 LTS / Alestic cloud image) so there's no possible conflict there. All devices were clean, as they are freshly created EBS volumes and I knew none of them were in use.

Before running mdadm --create, mdstat was clean:

$ cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]

unused devices: <none>

And yet after running it the devices were assigned to two different devices instead of just /dev/md0:

$ cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]

md127 : inactive xvdh4[3](S) xvdh3[2](S)

1048573952 blocks super 1.2

md0 : inactive xvdh2[1](S) xvdh1[0](S)

1048573952 blocks super 1.2

unused devices: <none>

Looking into dmesg didn't reveal anything interesting either:

$ dmesg

...

[3963010.552493] md: bind<xvdh1>

[3963010.553011] md: bind<xvdh2>

[3963010.553040] md: could not open unknown-block(202,115).

[3963010.553052] md: md_import_device returned -16

[3963010.566543] md: bind<xvdh3>

[3963010.731009] md: bind<xvdh4>

And strangely, the creation or assembly would sometime work and sometime not:

$ mdadm --manage /dev/md0 --stop

mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]

mdadm: /dev/md0 has been started with 4 drives.

$ mdadm --manage /dev/md0 --stop

mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]

mdadm: cannot open device /dev/xvdh3: Device or resource busy

$ mdadm --manage /dev/md0 --stop

mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]

mdadm: cannot open device /dev/xvdh1: Device or resource busy

mdadm: /dev/xvdh1 has no superblock - assembly aborted

$ mdadm --manage /dev/md0 --stop

mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]

mdadm: /dev/md0 has been started with 4 drives.

I started suspecting I was facing some kind of underlying race condition where the devices would get assigned/locked during the device creation process. So I started googling for "mdadm create race" and I finally found a post that tipped me off. While it didn't provide the solution, the post put me on the right track by mentioning udev and it took only a few more minutes to narrow down on the solution: disabling udev events during device creation to avoid contention on device handles.

So now our script goes something like:

$ udevadm control --stop-exec-queue

$ mdadm --create /dev/md0 --run --level=0 --raid-devices=4 ...

$ udevadm control --start-exec-queue

And we now have consistent reliable device creation.

Hopefully this blog post will help other passers-by with a similar problem. Good luck!

Amazon Web Services Outages: 4 Steps for Survival

2012-07-03T08:45:00.002-07:00

(Cross-post from the Bizo Blog)

Another Amazon Web Services (AWS) cloud outage over the past weekend took down some pretty major services such as Netflix, Heroku, Pinterest and Instagram. At Bizo, a company that provides business marketing services for hundreds of F1000 clients, we serve billions of requests a day across tens of thousands of websites, and have our entire infrastructure on the AWS cloud, but didn’t have any downtime. The simple reason is that we take our customers’ uptime and site performance seriously, and have built tools and services on AWS to ensure high-availability (HA) and low-latency (LL) services. Despite the FUD created by many of the industry blogs and press, it is possible to create HA and LL services on AWS if you follow some simple steps.

Amazon Web Services Outages: 4 Steps for Survival

AWS Billing Info in Hive

2012-06-14T20:33:00.001-07:00

Amazon recently (finally!) launched programmatic access to your AWS billing data.

Once you turn it on, select a bucket, grant access to the AWS system user, you'll get a .csv file with your estimated billing for the month. The files are delivered daily, but they contain month-to-date information, and will replace the file from the previous day.

It's easy enough to view this information in excel (or similar), but I thought it would be fun to take a look in hive, especially once we start having data for a few months to aggregate over.

Amazon delivers the data to the root of your bucket. I decided to start moving it to a hive-partitioned path, to make it easier to query once we start have more data. I wrote a simple scala script to move the data to [bucket]/partioned/year=[year]/month=[month]/[file]. Here's some example code.

Ok, now we're ready to read the data in Hive.

Here's a hive schema for the AWS billing information. It uses the csv-serde (make sure you add that jar before running the create table statement). Run alter table aws_billing recover partitions; to load in the partitions (one per year/month), and you're ready to query.

Like I said, it's overkill to use hive to read this data for a month or so, but it's just so addictive having a SQL interface to arbitrary S3 data :).

Here are some example queries to get you started.

Costs by Service

select ProductCode, UsageType, Operation, sum(TotalCost)
  from aws_billing
 where RecordType in ("PayerLineItem", "LinkedLineItem")
 group
    by ProductCode, UsageType, Operation
;

EC2 usage, by size (across EC2/EMR)

select ProductCode, UsageType,sum(TotalCost)
  from aws_billing
 where RecordType in ("PayerLineItem", "LinkedLineItem")
   and UsageType like "BoxUsage%"
 group
    by ProductCode, UsageType
;

the golden rule of programming style

2012-06-13T16:10:00.001-07:00

There's an interesting page on the subject of compilation units per file over at the scala style guide.

The guideline, is, delightfully vague, which I will paraphrase as: Mostly use single files, unless you can't, or unless it's better if you don't.

The author(s) go on to expand on the reasoning behind breaking the guideline:

Another case is when multiple classes logically form a single, cohesive group, sharing concepts to the point where maintenance is greatly served by containing them within a single file. These situations are harder to predict… Generally speaking, if it is easier to perform long-term maintenance and development on several units in a single file rather than spread across multiple, then such an organizational strategy should be preferred for these classes.

This touches on what I consider to be the golden rule of programming style: Make your intent clear and the code easy to read.

Software spends most of its life in maintenance, which is why we have style guides and coding standards. It's valuable to have consistent looking code to promote a shared vocabulary, improve readability, and steer away from confusing or error-prone constructs.

It is just as important to be able to understand, both as an author and as a reviewer, that in certain cases following the letter of the law goes against the main goal of improving readability and maintenance. A one-size-fits-all rule does not always work, and as the authors of this particular guideline mention, "these situations are harder to predict."

Make your intent clear and the code easy to read.

Scala Test Plug-in for Sublime Text 2

2012-04-20T17:19:00.000-07:00

I have documented and put some polish on the Sublime Text 2 plug-in I blogged about previously. It lets you run a single Scala Test, or all tests in your project. It also lets you quickly navigate to any scala files in your project folder, and switch back and forth between a class and its test. Check it out here: https://github.com/patgannon/sublimetext-scalatest

Dev Days: Hacking, Open Source and Docs

2012-04-20T16:21:00.002-07:00

Dev Days

Every month we have a "Dev Day" where engineers take a break from their projects and work on "other stuff". Most start-up engineering teams have a "Hack Day" where everyone gets to hack on anything they want as long as they ship and share it with the rest of the team. Of course we have Hack Days but we also have other types of Dev Days too. In fact, we have three types of Dev Days:

Hack Days
Open Source Days
Doc Days

Open Source Days

You know what Hack Days are so I'll move on quickly to Open Source Days. Just like most companies these days, Bizo uses a lot of open source software (OSS). We love OSS and the community of developers and companies that share it. Over the last few years, we've used plenty of OSS but we've also created and given back lots of code as well.

Actually today is one of our Open Source Days so all the engineers are working on both new and old open source projects. You can check out our (growing) list of projects by visiting code.bizo.com. Over the years, we've created a lot of tools around AWS including s3cp, fakesdb, aws-tools (package of all CLI tools). We've also built a lot of stuff for Hadoop (Hive, etc) including csv-serde, gdata-storagehandler and our latest is a scala query language called revolute (still in development). In addition, we've have a wide variety of other awesome code including the awesome Joist, dependence.js, raphy-charts and other fun stuff!

Doc Days

The third type of Dev Day we have is called Doc Days. I know that you are thinking but Doc Days are extremely valuable days for engineering and everyone else for that matter. On Doc Day the entire engineering team works on wiki pages, code documentation, design docs, architecture docs and even blog posts. It really is better than it sounds!

If you've read my post on "building a kick ass engineering team", you know that one of the keys is the 3Cs... Communication, Communication, Communication! (My high school baseball coach taught me that one.) As a engineering team, we believe that communication is one of the best things we can do for each other. As any company grows communication becomes a larger and larger part of day to day and we see Doc Days as a way to ensure that we are communicating as clearly and accurately as we can.

Conclusion

These Dev Days have been a huge success for Bizo engineering. We've even inspired other departments to have similar days (Marketing in particularly like these documentation days!). We challenge you to go beyond the "Hack Day" and start thinking about other Dev Days that your engineering organization can benefit from.

Implementation driven interfaces?

2012-04-05T16:36:00.001-07:00

I've recently encountered some interesting pagination in the Google Groups admin interface. It starts off simple enough, nothing exciting here...

Instead of the usual 'Previous', we see 'First' on the next page.

Are they just being clever? Knowing that there's only one previous page? No such luck…

We've reached the end of the list. I hope you've found what you're looking for, otherwise start over from the beginning!

One has to wonder, who designed this interface? You can only go forward. If you overshoot, it's back to square one, then click, click, click… It's clear it does not have users in mind at all.

My guess is that it's based on some limitation in the backend storage or query mechanism. The system only allows forward navigation of query results, so the interface simply mirrors that…

What an incredibly frustrating experience! I'll never take simple pagination for granted again.

It's a good reminder to think about your users and how they will interact with the system. Mirroring the programming interface rarely works.

Capturing Client Side JS Errors on AWS

2012-04-04T08:56:00.006-07:00

I saw a post go by on Hacker News this morning discussing capturing and reporting on client side errors. We have been doing this for a long time and I wanted to share our approach.

Background
Quick background, we have two major types of javascript that our customers and partners may use: analytics tags and ad tags. Both tags are javascript and share the same error capture code.

Another quick note is that we run on Amazon Web Services so this approach is based on some of these services including S3, CloudFront and EMR.

Implementation
Our client side JS is compiled from Coffeescript. I've created a couple of gists to show you what the error logging code looks like in Coffeescript.

Details
The example shows our ad tags trying to execute surrounded by a try/catch that captures the error and eventually results in loading an image appending the relevant error metadata.

AWS Details
The image that is loaded actually lives on CloudFront. The CloudFront distribution is setup with logging which means that requests are logged and delivered to a specified S3 bucket (usually within 24 hours). Every day we run an EMR job against the CloudFront request logs that generates a report summarizing the errors. And that is it. Pretty simple and this approach has worked for us.

Pre-emptive "this isn't perfect" response
Some of you may be thinking, "you may not get all requests!". CloudFront logs are not supposed to be used for 100% accurate reporting (although nothing is really 100%). In our case, we don't need to capture all errors rather we are looking for directional information.

Creating Plug-ins for Sublime Text 2

2012-04-02T11:48:00.000-07:00

I have been trying out Sublime Text 2 as my text editor lately, and I'm loving the simplicity, so I figured I would try out creating a plug-in for it. I was pleasantly surprised at how easy it is, which is an important step towards it becoming my new editor of choice. I wanted to take some steps towards creating something along the lines of rinari, but for Scala... in Sublime Text. I was able to fairly easily easily create a plug-in that allowed me to run the Scala Test that was currently open in the editor, or run all Scala Tests in the (inferred) project folder, or switch back and forth between a test and the code under test, or quickly navigate to any scala file in the project folder with a few keystrokes. This post will show you how to create a new plug-in for Sublime Text 2, which uses all the API features that I needed to implement that functionality.

Create a new plug-in

Step 1. Install Sublime Text 2 (see link above). Its free to try, and fairly cheap to buy. A month or so after you download it, it basically becomes nag-ware until you finally manage to overcome your stingy developer impulses and plunk down the $59 to buy it. Also, unlike other similar text editors (ahem.. TextMate!) it actually runs on Windows and Linux, as well as Mac OSX.

Step 2. Create a new folder for your plug-in. On Mac OSX, this goes under your home folder in ~/Library/Application Support/Sublime Text 2/Packages/{PLUGIN_NAME} (where in my case, {PLUGIN_NAME} was "ScalaTest").

Step 3. Create a python file which will contain the code for the plug-in. (Name it whatever you want, as long as it ends in ".py" ;-) Here is a really basic plug-in (borrowed from this plug-in tutorial, which you should read after this):

import sublime, sublime_plugin

class ExampleCommand(sublime_plugin.TextCommand):
  def run(self, edit):
    self.view.insert(edit, 0, "Hello, World!")

Right, so, as I mentioned: Sublime Text 2 plug-ins are written in Python. Don't worry too much if you're not familiar with Python... I wasn't either prior to starting this experiment, and it didn't prove to be too much a problem. (I did have a couple Python books laying around, but I'm sure the same information is on the tubes.) Its fairly easy to pick up, and has some similarities to Ruby, in case that helps. So the code above creates a command called "example" which is defined by a class that inherits from Sublime Text's "TextCommand" class. (Sublime Text 2 maps the title-case class names to underscore-delimited command names, and strips the "Command" suffix.) All the plug-in does is insert the text "Hello, World!" at the beginning of the file open in the editor.

(Note: Sublime Text 2 will detect that you created a Python file under its plug-in folder and automatically loads it.)

Step 4. Run your example. Hit Ctrl+Backtick to open the python interpreter within Sublime Text 2. Run your command by typing in this:

view.run_command("example")

The open buffer will now include the aforementioned greeting. You could bind it to a key-combination easily enough, but hey, it doesn't do anything cool yet, right?, so we'll hold off on the key bindings until the end.

Make it do something cool

So that you can see these approaches in action, I uploaded my nascent ScalaTest plug-in to github:https://github.com/patgannon/sublimetext-scalatest. Note that this plug-in will currently only work with projects that use Bizo's standard folder structure, and has a hard coded path to the scala executable, so its not ready to be used as-is. I hope to clean it up in the future and make it more generically applicable, but for now, I've only shared it to add a bit more color to the code snippets in this section.

Run a command on the current file

The name of the file currently open in the editor can be obtained with this expression: self.view.file_name(). In my plug-in, I use that to infer a class name, the project root folder, and path to the associated test (using simple string operations).

You can create an output panel (in which to render the results of running a command on the open file) by calling: self.window.run_command("show_panel", {"panel": "output.tests"}) (where "output.tests" is specific to your plug-in). In my plug-in, I created the helper methods below to show the panel and clear out its contents. (See the BaseScalaTestCommand class in run_scala_test.py). Note that this code was derived from code I found in theSublime Text 2 Ruby Tests plug-in.

 def show_tests_panel(self):

  if not hasattr(self, 'output_view'):

   self.output_view = self.window().get_output_panel("tests")

  self.clear_test_view()


  self.window().run_command("show_panel", {"panel": "output.tests"})



 def clear_test_view(self):

  self.output_view.set_read_only(False)

  edit = self.output_view.begin_edit()


  self.output_view.erase(edit, sublime.Region(0, self.output_view.size()))

  self.output_view.end_edit(edit)

  self.output_view.set_read_only(True)

(Note: I don't recommend copy/pasting code directly from this blog post, because the examples are pasted in from github, which messes up the indentation, which is a real problem in Python; instead, clone the github repository and copy/paste from the real file on your machine.)

To actually execute the command, I use this code in my run method, after calling show_tests_panel defined above (note that you will need to import 'subprocess' and 'thread' at the top of your plug-in file):

  self.proc = subprocess.Popen("{my command}", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

  thread.start_new_thread(self.read_stdout, ())

...where {my command} is the shell command I want to execute, and read_stdout is a method I defined which copies the output from the process and puts it into the output panel. Its defined as follows (and calls the append_data method, also defined below):

 def read_stdout(self):

  while True:

   data = os.read(self.proc.stdout.fileno(), 2**15)



   if data != "":

    sublime.set_timeout(functools.partial(self.append_data, self.proc, data), 0)


   else:

    self.proc.stdout.close()

    break

 def append_data(self, proc, data):

  self.output_view.set_read_only(False)

  edit = self.output_view.begin_edit()

  self.output_view.insert(edit, self.output_view.size(), data)


  self.output_view.end_edit(edit)

  self.output_view.set_read_only(True)

(Note: Depending on the command you're running, you may also want to capture the process' stderr output, and also put that into the output panel, using a variation of the approach above.)

Using the "quick panel" to search for files, and opening files

The "quick panel" (the drop-down which lists files when you hit command-T in sublime-text) can be extended to have plug-in specific functionality, which I used to create a hot-key for quickly navigating to any Scala file under my project folder. (See the JumpToScalaFile class in run_scala_test.py.) One of the plug-in examples I saw using the quick panel sub-classed sublime_plugin.WindowCommand instead of TextCommand. This results in a plug-in which can be run without any files being open. The flip side of that, though, is you don't get the file name of the currently open file, which in my case, is required to infer the base project folder for which to search for files. Thus, all my plug-ins sub-class TextCommand. To open the quick panel, execute: sublime.active_window().show_quick_panel(file_names, self.file_selected). file_names should be a collection of the (string) entries to show in the quick panel. Note that the entries don't have to be file paths, just a convenient identifier to show the user (in my case, the class name). file_selected is a method you will define which will be called when a user selects an entry in the quick panel. Here's how I defined it:

 def file_selected(self, selected_index):

  if selected_index != -1:

   sublime.active_window().open_file(self.files[selected_index])

self.files is an array I created when populating the quick panel which maps an index in the quick panel to a file path. I then use sublime.active_window().open_file to open that file in Sublime Text.

I also used that same method (open_file) in the plug-in that automatically navigates back and forth between a test file and the code under test. That plug-in also makes use of the sublime.error_message method, which will display an error message to the user (if no test is found, for example).

Create keystroke bindings

To bind your new plug-in commands to keystrokes, create a file in your plug-in folder called Default (OSX).sublime-keymap. This will contain the keystrokes that will be used on Mac OSX. (You would create separate files for use on Windows and Linux.) It is a simple JSON file that maps keystrokes to commands. Lets see an example:

[

 { "keys": ["super+shift+e"], "command": "jump_to_scala_file" }

]

This example will bind Command+Shift+e to the "jump_to_scala_file" command (defined in the JumpToScalaFileCommand class in any plug-in). If you have multiple key-mappings, you would create multiple comma-delimited entries within the JSON array. (See the example in my plug-in.) In order to reduce the possibility of defining keystrokes that collide with keystrokes from other plug-ins, I defined mine in such a way that they're only available when the currently open file is a Scala file. Here is the rather verbose (ahem, powerful) syntax that I used to do that:

[

 { "keys": ["super+shift+e"], "command": "jump_to_scala_file", 

  "context" : [{"key": "selector", "operator": "equal", "operand": "source.scala", "match_all": true}]}


]

Conclusion

Over the years, I've grown to prefer light-weight editors (such as emacs or Sublime Text 2) over more heavy-weight IDEs (such as Eclipse or Visual Studio) because they don't tend to lock up in the middle of writing code and/or crash sporadically, and I generally don't need a lot of whiz-bang features when I'm coding these days. I used emacs (and rinari) for doing rails development for a year or so, but the basic key-strokes (compared to the de-facto text editing standard key-strokes) and the undo/redo functionality always seemed a bit awkward, especially when you wind up switching back and forth between that and other text editors. Also, the language for creating plug-ins is Scheme (a dialect of Lisp), which to me isn't very convenient for these sort of things.

I was really pleased with my foray into creating plug-ins for Sublime Text 2, and combined with its general ease of use, I've decided its now my new favorite editor. Using an editor that's this easy to significantly customize seems like it could be a real productivity win over time. Given the fairly rich list of plug-ins already available, I think the future is bright for Sublime Text 2. Below are a list of resources I found helpful during this process, including said list of plug-ins.

Resources

Unofficial documentation: http://sublimetext.info/docs/en/index.html especiallyhttp://sublimetext.info/docs/en/extensibility/plugins.html
Official plug-in examples (sparse): http://www.sublimetext.com/docs/plugin-examples
Official API reference: http://www.sublimetext.com/docs/2/api_reference.html

Helpful examples:
https://github.com/maltize/sublime-text-2-ruby-tests
https://github.com/noklesta/SublimeRailsNav
https://github.com/luqman/SublimeText2RailsRelatedFiles
https://github.com/rspec/rspec-tmbundle

Unofficial list of plug-ins:
http://wbond.net/sublime_packages/community

A Short Script for Logging into Interactive Elastic MapReduce Clusters

2012-03-13T17:24:00.001-07:00

Elastic MapReduce is great, but the latencies can be painful. For me, this is especially true when I'm in the early stages of developing a new job and need to make the transition from code on my local machine to code running in the cloud -- the ~5 minute period between starting up a cluster and actually being able to log on to it is too long to sit there staring at a blank screen and too short to effectively context switch to something else in a useful way.

My current solution is to allow myself to get distracted but to drag myself back to my EMR session as soon as it's available. Adding some simple polling plus a sticky growl notification to my interactive-emr-startup script does the trick quite nicely:

#!/bin/bash

if [ -z "$1" ]; then
echo "Please specify a job name"
exit 1
fi

elastic-mapreduce \
(... with all of my favorite options ...) \
| tee ${TMP_FILE}

JOB_ID=`cat ${TMP_FILE} | awk '{print $4}'`
rm ${TMP_FILE}

# poll for WAITING state
JOB_STATE=''
MASTER_HOSTNAME=''
while [ "${JOB_STATE}" != "WAITING" ]; do
sleep 1
echo -n .
RESULT=`elastic-mapreduce --list | grep ${JOB_ID}`
JOB_STATE=`echo $RESULT | awk '{print $2}'`
MASTER_HOSTNAME=`echo $RESULT | awk '{print $3}'`
done
echo Connecting to ${MASTER_HOSTNAME}...

growlnotify -n "EMR Interactive" -s -m "SSHing into ${MASTER_HOSTNAME}"

ssh $MASTER_HOSTNAME -i ~/.ssh/emr-keypair -l hadoop -L 9100:localhost:9100

One of my personal productivity goals for the year is finding little places like this that I can optimize with a short script. This particular one has rescued me from the clutches of HN more than once!

On Code Reviews and Developer Feedback

2012-03-13T10:53:00.001-07:00

There's a great post from last week at 37signals, Give it five minutes:

While he was making his points on stage, I was taking an inventory of the things I didn’t agree with. And when presented with an opportunity to speak with him, I quickly pushed back at some of his ideas. I must have seemed like such an asshole.

His response changed my life. It was a simple thing. He said “Man, give it five minutes.” I asked him what he meant by that? He said, it’s fine to disagree, it’s fine to push back, it’s great to have strong opinions and beliefs, but give my ideas some time to set in before you’re sure you want to argue against them. “Five minutes” represented “think”, not react. He was totally right. I came into the discussion looking to prove something, not learn something.

There’s also a difference between asking questions and pushing back. Pushing back means you already think you know. Asking questions means you want to know. Ask more questions.

This is such a great outlook and a great way to approach the discussion of feedback for code reviews and design reviews.

It's surprising how little time development teams devote to training, or even internal discussion on effective feedback. As developers, we are constantly engaged in this kind of communication: white-boarding sessions, spec reviews, design reviews, code reviews. We're expected to give and receive feedback on a daily basis, but few of us are properly prepared for it. Not only do we lack the training, but we have many negative examples to draw from. Who hasn't been a part of a design review where tempers flare? Properly giving feedback is something that requires constant attention and practice. Receiving feedback can be just as difficult.

Culture of Communication

One of the major pillars of our engineering culture at bizo is "the 3 Cs": Communication, Communication, Communication.

We've tried hard to build a team of engineers that are eager to receive feedback, humble about their abilities, objective and gracious with their feedback, and freely giving of their own knowledge and experience. We see communication as a prerequisite for building a world-class team and developing high-quality code. You often hear the phrase "strong opinions, weakly held," and that is the kind of culture we have tried to build.

Communication is hard. It takes real team agreement and commitment to continued work to keep this culture alive and well. It's important the team views effective communication as important and that the culture supports it.

Code Reviews

Code reviews are something that can easily be approached from the wrong perspective, both as an author or reviewer.

As a reviewer, it can be easy to jump in and argue, to try and push 'your' solution (even though it may be equivalent), to push back instead of asking questions and trying to understand.

As an author, it's far too easy to get attached to your code, to your specific solution/naming/etc. It's also easy to feel like each comment is an attack on your ability, and that by accepting the feedback, this somehow means that you were wrong or did a bad job. Of course, nothing could be further from the truth!

At Bizo, we perform code reviews for every change. They are a major part of our culture of communication. In order to perform effective code reviews, it's important to have some shared guidelines that help support effective communication.

Here are some guidelines we've found to be helpful for performing code reviews:

What is a code review

A careful line-by-line critique of code by peers
happens in a non-threatening context
goal is cooperation and mutual learning, not fault finding

Code reviews are a team exercise to improve understanding and make the code better!

When people think of code reviews they usually think of catching bugs. Code reviews do occasionally catch bugs or potential performance problems, but this is rare.

Just as important is fostering a shared understanding of the code and exposure to new approaches, techniques, and patterns. Seeing how your peers program is a great way to learn from them.

Ensuring coding standards and style guides is another way code reviews help. Working on a team it's important to keep readability and quality high using a shared vocabulary.

As an author

As an author, it's important to view each comment as a new opportunity to improve your code. Instead of jumping into defense mode, take a step back and think. Try to approach the code again for the first time with this new perspective. Your team has a lot of experience and varied backgrounds -- draw from them! They are there to help you. Use the gift of their experience and knowledge to improve the code.

Trust the team, and view all comments as action items. Some changes can seem arbitrary, especially when it comes to naming and organization. Unless there's a strong reason, tend to agree with your reviewers. If a reviewer finds something confusing, it is confusing! Code spends most of its life in maintenance and programming is a team sport. Remember that they are your audience, and you want them to be able to understand your code at 4am after a system crash.

As a reviewer

As a reviewer, it's important to take the time to understand the code, think, and ask questions to understand the code before providing feedback. The author probably spent a lot more time thinking about the problem and the approach over the course of the project.

Be strict on coding standard and style guide violations. The real cost of software is maintenance (80% according to Sun). It's important the code is easily understood by the team.

Be gentle on personal preferences. If it's not a standard violation and just a matter of personal preference, defer to the author. It's okay to present your perspective, but mention that it's just a preference and not meant to be taken as an action item.

Trust the author. It's often the case that there are many valid approaches to a problem. It's great to present alternative approaches and discuss pros/cons of various approaches. If you see alternative solutions, bring them up! When discussing alternatives, make sure to listen to the author. Remember they are the subject matter expert and you are working together on the same team.

It takes work!

Communication is hard! It's easy to screw-up. It's easy to go into attack or defense mode when you're passionate about what you're doing. It's really something we all need to remind ourselves to work on every day. It's something we need to periodically remind ourselves as a team. Try to view each review as an opportunity to practice these guidelines. Just remember to take a step back, think, and ask questions.

Fault Tolerant MongoDB on EC2

2012-03-12T14:03:00.004-07:00

While working on a project at Bizo I needed to connect a Rails app to a MongoDB backend both of which run in Amazon's Cloud (EC2). At Bizo we have a policy to not use non Amazon services when possible (to limit risk) - so we normally run most of our services straight off of EC2. I'd like to share what I've learned as best practices throughout the experience as I hope it might save some time and frustration for others.

Primer

Replica sets are the preferred way to run a distributed, fault tolerant MongoDB service. But as with any distributed system, nodes will eventually fail. Now replica sets are pretty good at handling failures, but they can't save you if too many nodes fail.
Specifically a replica set requires a minimum of two nodes to function at all times (1 primary and 1 secondary node). Thus a good rule of thumb is to run **at least 3 nodes** in a replica set, that way if a node fails your database service doesn't go down with it. The Rails app I was working with doesn't experience enormous amounts of traffic so 3 m1.large (64bit) nodes were sufficient for my needs. What follows is a rundown of our setup and how it handles common needs of fault tolerant systems.

Best Practices

Minimize Failure with AutoScaling, Availability Zones, CloudWatch and EBS Volumes

Use Autoscaling Groups, CloudWatch and EBS Volumes to replace failed nodes as soon as they go down. Since we run three nodes, our replica set is insulated from failure due to a single node crashing. But if two nodes crash the replica set goes with them. To solve this we use Cloudwatch alarms to trigger the Autoscaling Group whenever a node goes down - that way a new replacement node is automatically brought online within a few minutes of a failure to reduce the risk of nodes sequentially failing. Additionally each node stores it's data on an EBS Volume (network attachable hard drive) - that way when a node fails, it's replacement doesn't startup with missing data - it simply mounts the previous node's EBS.
To protect against multiple nodes failing simultaneously run each node in a separate availability zone. The above isn't sufficient to protect against things like hardware failures as all 3 instances could wind up on the same hardware. Running each node in a separate availability zone guarantees that our mongo instances run with a reasonable amount of separation (eg. they don't all end up on the same hardware box). Ideally you'd run each node in its own region (separate data center), but this causes headaches trying to configure firewalls as Amazon does not allow security groups to be used across multiple regions (see security below). So unless you want to setup a VPN for cross region communication - you're probably better off just running in separate availability zones.
Assuming you've created the group and are running three nodes, each in a separate availability zone you can configure the auto scaling group using Amazon's command line tools like so:
```
as-update-auto-scaling-group my-mongo-service \
--region us-east-1 \
--availability-zones us-east-1a us-east-1b us-east-1c \
--max-size 3 \
--min-size 3 \
--desired-capacity 3
```
Now if any node fails then a new one will startup to take its place in the proper availability zone.

If Everything Fails, have backups

It's always good to have backups just in case something really bad happens. Fortunately since we use EBS Volumes this is really easy - we create nightly snapshots of the primary node's EBS Volume on our cron server using Amazon's command line tools (ec2-create-snapshot). These snapshots are persisted to S3 and we can easily restore our replica set from these backups.

Use Elastic IPs

As nodes fail and are replaced you want both your Replica set and your database clients to be able to find and connect to the new nodes. The easiest way to do this in Amazon is to use Elastic IPs - special* static ip addresses that can be assigned to individual instances. Since each instance runs in a separate availability zone we just need one Elastic IP per zone. When a new node starts up to replace a failed instance, it checks which zone it was started in and assigns itself the matching Elastic IP. Both the client and replica set configuration should point at the Elastic IP address - that way failures and startups of new nodes will be seamless to your app. This is because cross-security group openings in the firewall need to use internal (not external) addresses and the DNS url in the console will resolve to an internal ip address from an EC2 instance.

Security

This is where the headaches can start. Ideally you want to restrict access to your MongoDB instances to just your client application using Amazon's security groups. The way we normally set this up is to give your mongo instances a security group, say mongo-db-prod and your client app a security group, say cool-app-prod. Then mongo-db-prod would grant access on port 27017 (default mongodb port) to security group: cool-app-prod. Unfortunately what's not documented very well is that if you use the **external Elastic IP** addresses in your configuration it **will not work** with security groups! Instead you have to use the Elastic IPS DNS url (found in Amazon's web console) for security groups to work properly.

A Final Caveat

One thing to be careful of is if you require more than 5 nodes in a replica set you'll run into a problem using Elastic IPs - Amazon by default only allows 5 eips per region. You'll need to either ask Amazon to increase this limit on your account or seek out an alternative setup.
Well, that's it but If you have another setup for running MongoDB on EC2 I'd love to here it. Until next time.

Building a Product in Just 8 Hours

2012-02-17T09:21:00.000-08:00

Recently at Bizo, we decided to try a new kind of hack day. Previously during hackdays our engineers worked individually on their own project(s). But on our last hack day we decided to try something new – The 8 Hour Product Challenge.
We would build and launch a completely new product in the course of a normal workday (9-5pm). “Launching” meant this product had to be running publicly on the internet by 5pm – no excuses, no “Wait! I need 5 more minutes” – whatever was there had to be deployed. In short the experience was fantastic and I can’t wait to do it again. Here’s a breakdown of the experience:

Initial Meeting 9:30am

Organizing developers for a meeting of any kind is like trying to heard cats. But if it’s a meeting before 11:00am you’re not herding regular cats, you’re herding sleepy, fat cats with one leg and half an ear. Somehow after a lot of cattle prodding by our VP of engineering our team eventually managed to shuffle its way into the conference room like the decaffeinated zombies we were and get to work.
We decided to build a stealth product. The system would use Bizo’s rich business data to personalize special content for visitors based on things like their industry, company size and seniority. The goal for the end of the day was to have a small webapp up and running on Amazon’s servers.
After a bit more discussion, we decided to split tasks up into five groups of two engineers:

Data Discovery Team – Find relevant items for users by using our B2B business data & network.
Scraping Team – given a url representing an item scrape the page contents and store them for later use
Data Classification Team – extract relevant data from the HTML source of the previously scraped urls.
Backend Team – backend architecture for the webapp that fetches serves the content from that was generated in the previous steps
Frontend Team – frontend design + javascript that makes the app functional

Engineers were assigned more or less randomly, with the exception of myself – I was assigned to the frontend team directly. During the course of our initial planning meeting, we (myself included) often found ourselves becoming sidetracked with feature bloat, premature scalability concerns and a myriad of other things not essential to our MVP. Fortunately one of my coworkers (Stephen) was smart enough to enforce timeboxing the meeting to one hour – eventually we got things back on topic and designed the critical components before time ran out.

Start Work 10:30am

Once the teams were assigned, we all jumped in and started to work on our relevant tasks. My partner, Darren and I immediately started out by sketching ideas on paper for our design. I can’t stress enough how important sketching is for being able to rapidly prototype a product – a trick I picked up back when I interned over at ZURB. Only after we had some solid sketches did we move into Photoshop mockups. Meanwhile the other teams were all furiously programming their parts of the application:

Data Discovery Team Worked out a simple ranking algorithm for data and started writing the Hive script to extract it.
URL Scraping Team Started out in Scala hacking up a script to scrape & download url content.
Data Extraction Team Decided to try out the Pismo gem to extract summaries and titles from scraped html content.
Backend Team Was working on getting a sweet Scalatra webapp up and running

Lunch Time & Status Updates 12:30pm

By lunchtime everything seemed to be coming along nicely. On the frontend had completed our Photoshop mockup and had just began writing some basic css styles. All the other teams reported making good progress on their tasks, with no major snags in the foreseeable future (betcha you never heard that one before…).

Afternoon 1:30pm

My frontend partner and I powered through our post lunch food coma and were able to move into begin wiring up the ui using CoffeeScript in conjunction with Dependence.js. My teammate and I decided to give pair programming a shot. He has always been more of an Emacs kind of guy, while I prefer vi, but in the interests of learning I decided to try Emacs for the rest of the day – I now know why he’s always so worried about contracting carpel tunnel syndrome :).
Using some sample data generated by the first three teams we were able to get a rough ui working pretty quickly. Our side of things turned out to be pretty straightforward and involved three AJAX requests. One was to retrieve a list of items grouped by segment from the Scalatra web server, one to get a list of the current targetable segments from the Bizo API and another call to the API to retrieve a visitor’s business segments (bizographics).
The only snag we hit was running into a race condition – originally we attempted executing all three requests simultaneously. In reality we had to wait to get the list of segments before getting the visitor’s profile. Darren and I just looked at each other and shrugged, then we indented the third API request a few spaces in our CoffeeScript code – race condition solved! Yes that’s correct we fixed a race condition by indenting some code, don’t judge – it was a hackday.
The other teams all seemed to be doing well, the scraping team discovered the Scala Collection’s magical par(), which turns normal data structures into parallel ones, they almost peed their pants with joy. At this point the backend team had completed the Scalatra app and was working on setting up our eventual deployment to EC2 using our custom infrastructure, cowboy.

The Home Stretch & Deployment 4:30pm

Right around 4:30 we ran into a major problem. There had been a miscommunication regarding the necessary format of the JSON file needed by the frontend, and our data was coming through to the app in a format that just wouldn’t work. We had to scramble and hash things out with the other teams quickly before we hit our 5pm deadline. Thanks to a major push by the Data Extraction Team we were finally able to get everything in place. The product worked! – it wasn’t the most polished app ever, but what we had accomplished in just one day was pretty amazing. The product was deployed on EC2 and presented internally within Bizo – it was met with a lot of excitement and Bizo will probably be releasing it publicly in the weeks to come.

Closing Thoughts

Overall the experiment turned out to be a smash hit. Looking back on the experience there are a bunch of things we could have done better. In retrospect we got sidetracked a bit too much on non essential feature ideas when we really should have been spending time clarifying the format of the data as it passed through each team’s project – this was something that came back and bit us near the end of the day. But mishaps aside it’s really amazing what you can accomplish with 9 other talented people in a single day.
I can’t recommend trying this with your team enough – what do you think, is your engineering team up for The 8 Hour Product Challenge? If you’re interested in the technologies we used throughout the day, see below.

Appendix, Technologies Used

News Discovery Team – Hive, Amazon EC2, Amazon S3
URL Scraping Team – Scala
Data Extraction Team – jRuby, gems of note: pismo, right_aws (for Amazon S3)
Backend Team – Scalatra, Amazon S3
Frontend Team – Photoshop, HTML5, Sass, CoffeeScript, jQuery, dependence.js

work at Bizo (looking for some good engineers)

2012-01-30T19:38:00.001-08:00

We’re a small, disciplined team that gets a lot done. Our platform processes billions of page views monthly and 100s of terabytes of data so we have lots of fun problems to tackle. We believe in teamwork and communication: comments, design reviews, code reviews for every change, weekly tech talks. We believe in giving developers ownership over projects. We believe Engineering is more than coding. We have fun and keep the beer fridge well stocked.

We have customers, are well funded and recently named the forth fastest growing private company in the San Francisco Bay Area.

We are looking for motivated problem solvers with an entrepreneurial / hacker spirit.

If you're a reader of this blog, you already know our technology stack. Some highlights: Scala, Java, Javscript, Ruby, AWS (pretty much every service), Hadoop/Hive, GWT, MongoDB, Solr, etc.

If you're interested, please apply on stackoverflow.

Configuration	Connection timeout	Socket timeout	MaxTotal	MaxPerRoute
SameRegion	125 ms	125 ms	200	100
SameRegionWithUSEastFailover	1 second	1 second	200	100
CrossRegion	10 seconds	10 seconds	200	100
MaxTimeout	1 minute	5 minutes	200	100