Wednesday, September 26, 2012

Grouping pageviews into visits: a Scala code kata

The basic units of any website traffic analysis are pageviews, visits, and unique visitors.  Tracking pageviews is simply a matter of counting requests to the server.  Calculating unique visitors usually relies on cookies and unique identifiers.  Visits, however, require a bit more work.  For our purposes, a single visit is defined as a sequence of pageviews where the interval between pageviews is less than a fixed length like 15 minutes.

I thought that the problem of grouping pageviews into visits would make an interesting code kata.  Here’s the statement of the problem that I worked from:

Given a non-empty sequence of timestamps (as milliseconds since the epoch), write a function that would return a sequence of visits, where each visit is itself a sequence of timestamps where each pair of consecutive timestamps is no more than N milliseconds apart.

As a starting point, I decided to take a straightforward procedural approach:

def doingItIteratively(pageviews: Seq[Long]): Seq[Seq[Long]] = {
 val iterator = pageviews.sorted.iterator
 val visits = ListBuffer[ListBuffer[Long]]()

 var previousPV: Long = iterator.next
 var currentVisit: ListBuffer[Long] = ListBuffer(previousPV)

 for (currentPV <- iterator) {
   if (currentPV - previousPV > N) {
     visits += currentVisit
     currentVisit = ListBuffer[Long]()
   }

   currentVisit += currentPV
   previousPV = currentPV
 }
 visits += currentVisit

 visits map (_.toSeq) toSeq
}

So, we simply iterate through the (sorted) events tracking both the current visit and the previous pageview.  If the current pageview represents a new visit, push the previous visit into the list of all visits and start a new one.  Then push the current pageview into the (potentially new) visit.

It actually felt a bit odd to write procedural code like this and ignore the functional parts of Scala.  Using a fold cleans the code up a bit and gets rid of the mutable state.

def doingItByFolds(pageviews: Seq[Long]): Seq[Seq[Long]] = {
 val sortedPVs = pageviews.sorted

 (Seq[Seq[Long]]() /: sortedPVs) { (visits, pv) =>
   val isNewVisit = visits.lastOption flatMap (_.lastOption) map {
     prevPV => pv - prevPV > N
   } getOrElse true

   if (isNewVisit) {
     visits :+ Seq(pv)
   } else {
     visits.init :+ (visits.last :+ pv)
   }
 }
}

Here, we’re starting with an empty list of visits and folding it over the sorted pageviews.  At each pageview, we decide if we need to start a new visit.  If so, we append a new visit containing the pageview to the accumulated visits.  If not, we pop off the last visit, append the pageview, and put the last visit back on the tail of the accumulated visits.

One part that’s still a bit messy is comparing the current timestamp to the previous one.  We can improve that by iterating through the intervals between pageviews instead of the actual pageviews.

def slidingThroughIt(pageviews: Seq[Long]): Seq[Seq[Long]] = {
 val intervals = (0L +: pageviews.sorted).sliding(2)

 (Seq[Seq[Long]]() /: intervals) {
   (visits, interval) =>
     if (interval(1) - interval(0) > N) {
       visits :+ Seq(interval(1))
     } else {
       visits.init :+ (visits.last :+ interval(1))
     }
 }
}

Here, we’re prepending a “0L” timestamp (and assuming that none of the pageviews happened in the early 70s) and using the “sliding” method to pair each timestamp with the previous one.

So far, we’ve been using a sequence of pageviews as a visit.  What happens if we add an explicit Visit type?  This lets us convert all pageviews into Visits at the start, then focus on merging overlapping Visits.  One nice benefit is that this is a map-reduce algorithm that can be easily parallelized instead of one that must sequentially iterate over the pageviews (either explicitly or with a fold).

case class Visit(start: Long, end: Long, pageviews: Seq[Long]) {
 def +(other: Visit): Visit = {
   Visit(min(start,other.start), max(end, other.end),
         (pageviews ++ other.pageviews).sorted)
 }
}

def doingItMapReduceStyle(pageviews: Seq[Long]): Seq[Visit] = {
 pageviews.par map { pv =>
   Seq(Visit(pv, pv+N, Seq(pv))
 } reduce { (visit1, visit2) =>
   val sortedVisits = (v1 ++ v2) sortBy (_.start)

   (Seq[Visit]() /: sortedVisits) { (visits, next) =>
     if (visits.lastOption map(_.end >= next.start) getOrElse false)
     {
       visits.init :+ (visits.last + visit)
     } else {
       visits :+ visit
     }
   }
 }
}

The map-reduce solution is fun, but in a production system, I’d probably stick with the sliding variation and add a bit more flexibility to track actual pageview objects instead of just timestamps.

Wednesday, September 19, 2012

Using GROUP BYs or multiple INSERTs with complex data types in Hive.

In any sort of ad hoc data analysis, the first step is often to extract a specific subset of log lines from our files.  For example, when looking at a single partner’s web traffic, I often use an initial query to copy that partner’s data into a new table.  In addition to segregating out only the data relevant to my analysis, I use this to copy the data from S3 into HDFS, which will make later queries more efficient.  (Using maps as our log lines is how we support dynamic columns.)

create external table if not exists
original_logs(fields map<string,string>) location “...” ;

create table if not exists
extracted_logs(fields map<string,string>) ;

insert overwrite table extracted_logs
select * from original_logs where fields[“partnerId”] = 123 ;

If I’m doing this for multiple partners, it’s tempting to use a multiple-insert so Hadoop only needs to make one pass of the original data.

create external table if not exists
original_logs(fields map<string,string>) location “...” ;

create table if not exists
extracted_logs(fields map<string,string>)
partitioned by (partnerId int);

from original_logs
insert overwrite table extracted_logs partition (partnerId = 123)
select * from original_logs where fields[“partnerId”] = 123
insert overwrite table extracted_logs partition (partnerId = 234)
select * from original_logs where fields[“partnerId”] = 234

Unfortunately, in Hive 0.7.x, this query fails with the error message “Hash code on complex types not supported yet.”  A multiple-insert statement uses an implicit group by, and Hive 0.7.x does not support grouping by complex types.  This bug was partially addressed in 0.8, which added support for arrays and maps, but structs and unions are still not supported.

At an initial glance, it does look like adding this support should be straightforward.  This could be a good candidate for our next open source day.

Saturday, July 7, 2012

mdadm: device or resource busy

I just spent a few hours tracking an issue with mdadm (Linux utility used to manage software RAID devices) and figured I'd write a quick blog post to share the solution so others don't have to waste time on the same.

As a short background, we use mdadm to create RAID-0 stripped devices for our Sugarcube analytics (OLAP) servers using Amazon EBS volumes.

The issue manifested itself as a random failure during device creation:


$ mdadm --create /dev/md0 --level=0 --chunk 256 --raid-devices=4 /dev/xvdh1 /dev/xvdh2 /dev/xvdh3 /dev/xvdh4
mdadm: Defaulting to version 1.2 metadata
mdadm: ADD_NEW_DISK for /dev/xvdh3 failed: Device or resource busy

I searched and searched the interwebs and tried every trick I found to no avail. We don't have dmraid installed on our Linux images (Ubuntu 12.04 LTS / Alestic cloud image) so there's no possible conflict there.  All devices were clean, as they are freshly created EBS volumes and I knew none of them were in use.  

Before running mdadm --create, mdstat was clean:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
unused devices: <none>

And yet after running it the devices were assigned to two different devices instead of just /dev/md0:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md127 : inactive xvdh4[3](S) xvdh3[2](S)
      1048573952 blocks super 1.2

md0 : inactive xvdh2[1](S) xvdh1[0](S)
      1048573952 blocks super 1.2

unused devices: <none>

Looking into dmesg didn't reveal anything interesting either:

$ dmesg 
...
[3963010.552493] md: bind<xvdh1>
[3963010.553011] md: bind<xvdh2>
[3963010.553040] md: could not open unknown-block(202,115).
[3963010.553052] md: md_import_device returned -16
[3963010.566543] md: bind<xvdh3>
[3963010.731009] md: bind<xvdh4>

And strangely, the creation or assembly would sometime work and sometime not:

$ mdadm --manage /dev/md0 --stop
mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]
mdadm: /dev/md0 has been started with 4 drives.

$ mdadm --manage /dev/md0 --stop
mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]
mdadm: cannot open device /dev/xvdh3: Device or resource busy

$ mdadm --manage /dev/md0 --stop
mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]
mdadm: cannot open device /dev/xvdh1: Device or resource busy
mdadm: /dev/xvdh1 has no superblock - assembly aborted

$ mdadm --manage /dev/md0 --stop
mdadm: stopped /dev/md0

$ sudo mdadm --assemble --force /dev/md0 /dev/xvdh[1234]
mdadm: /dev/md0 has been started with 4 drives.

I started suspecting I was facing some kind of underlying race condition where the devices would get assigned/locked during the device creation process.   So I started googling for "mdadm create race" and  I finally found a post that tipped me off. While it didn't provide the solution, the post put me on the right track by mentioning udev and it took only a few more minutes to narrow down on the solution:  disabling udev events during device creation to avoid contention on device handles.

So now our script goes something like:

$ udevadm control --stop-exec-queue
$ mdadm --create /dev/md0 --run --level=0 --raid-devices=4 ...
$ udevadm control --start-exec-queue

And we now have consistent reliable device creation.

Hopefully this blog post will help other passers-by with a similar problem.  Good luck!



Tuesday, July 3, 2012

Amazon Web Services Outages: 4 Steps for Survival

(Cross-post from the Bizo Blog)

Another Amazon Web Services (AWS) cloud outage over the past weekend took down some pretty major services such as Netflix, Heroku, Pinterest and Instagram.  At Bizo, a company that provides business marketing services for hundreds of F1000 clients, we serve billions of requests a day across tens of thousands of websites, and have our entire infrastructure on the AWS cloud, but didn’t have any downtime.  The simple reason is that we take our customers’ uptime and site performance seriously, and have built tools and services on AWS to ensure high-availability (HA) and low-latency (LL) services.  Despite the FUD created by many of the industry blogs and press, it is possible to create HA and LL services on AWS if you follow some simple steps.

Amazon Web Services Outages: 4 Steps for Survival

Thursday, June 14, 2012

AWS Billing Info in Hive

Amazon recently (finally!) launched programmatic access to your AWS billing data.

Once you turn it on, select a bucket, grant access to the AWS system user, you'll get a .csv file with your estimated billing for the month. The files are delivered daily, but they contain month-to-date information, and will replace the file from the previous day.

It's easy enough to view this information in excel (or similar), but I thought it would be fun to take a look in hive, especially once we start having data for a few months to aggregate over.

Amazon delivers the data to the root of your bucket. I decided to start moving it to a hive-partitioned path, to make it easier to query once we start have more data. I wrote a simple scala script to move the data to [bucket]/partioned/year=[year]/month=[month]/[file]. Here's some example code.

Ok, now we're ready to read the data in Hive.

Here's a hive schema for the AWS billing information. It uses the csv-serde (make sure you add that jar before running the create table statement). Run alter table aws_billing recover partitions; to load in the partitions (one per year/month), and you're ready to query.

Like I said, it's overkill to use hive to read this data for a month or so, but it's just so addictive having a SQL interface to arbitrary S3 data :).

Here are some example queries to get you started.

Costs by Service

select ProductCode, UsageType, Operation, sum(TotalCost)
  from aws_billing
 where RecordType in ("PayerLineItem", "LinkedLineItem")
 group
    by ProductCode, UsageType, Operation
;
  

EC2 usage, by size (across EC2/EMR)

select ProductCode, UsageType,sum(TotalCost)
  from aws_billing
 where RecordType in ("PayerLineItem", "LinkedLineItem")
   and UsageType like "BoxUsage%"
 group
    by ProductCode, UsageType
;
  

Wednesday, June 13, 2012

the golden rule of programming style

There's an interesting page on the subject of compilation units per file over at the scala style guide.

The guideline, is, delightfully vague, which I will paraphrase as: Mostly use single files, unless you can't, or unless it's better if you don't.

The author(s) go on to expand on the reasoning behind breaking the guideline:

Another case is when multiple classes logically form a single, cohesive group, sharing concepts to the point where maintenance is greatly served by containing them within a single file. These situations are harder to predict… Generally speaking, if it is easier to perform long-term maintenance and development on several units in a single file rather than spread across multiple, then such an organizational strategy should be preferred for these classes.

This touches on what I consider to be the golden rule of programming style: Make your intent clear and the code easy to read.

Software spends most of its life in maintenance, which is why we have style guides and coding standards. It's valuable to have consistent looking code to promote a shared vocabulary, improve readability, and steer away from confusing or error-prone constructs.

It is just as important to be able to understand, both as an author and as a reviewer, that in certain cases following the letter of the law goes against the main goal of improving readability and maintenance. A one-size-fits-all rule does not always work, and as the authors of this particular guideline mention, "these situations are harder to predict."

Make your intent clear and the code easy to read.

Friday, April 20, 2012

Scala Test Plug-in for Sublime Text 2

I have documented and put some polish on the Sublime Text 2 plug-in I blogged about previously.  It lets you run a single Scala Test, or all tests in your project.  It also lets you quickly navigate to any scala files in your project folder, and switch back and forth between a class and its test.  Check it out here: https://github.com/patgannon/sublimetext-scalatest

Dev Days: Hacking, Open Source and Docs

Dev Days

Every month we have a "Dev Day" where engineers take a break from their projects and work on "other stuff".  Most start-up engineering teams have a "Hack Day" where everyone gets to hack on anything they want as long as they ship and share it with the rest of the team.  Of course we have Hack Days but we also have other types of Dev Days too.  In fact, we have three types of Dev Days:

  • Hack Days 
  • Open Source Days 
  • Doc Days

Open Source Days

You know what Hack Days are so I'll move on quickly to Open Source Days.  Just like most companies these days, Bizo uses a lot of open source software (OSS).  We love OSS and the community of developers and companies that share it.  Over the last few years, we've used plenty of OSS but we've also created and given back lots of code as well.

Actually today is one of our Open Source Days so all the engineers are working on both new and old open source projects.  You can check out our (growing) list of projects by visiting code.bizo.com.  Over the years, we've created a lot of tools around AWS including s3cp, fakesdb, aws-tools (package of all CLI tools).  We've also built a lot of stuff for Hadoop (Hive, etc) including csv-serde, gdata-storagehandler and our latest is a scala query language called revolute (still in development).  In addition, we've have a wide variety of other awesome code including the awesome Joist, dependence.js, raphy-charts and other fun stuff!

Doc Days

The third type of Dev Day we have is called Doc Days.  I know that you are thinking but Doc Days are extremely valuable days for engineering and everyone else for that matter.  On Doc Day the entire engineering team works on wiki pages, code documentation, design docs, architecture docs and even blog posts.  It really is better than it sounds!  

If you've read my post on "building a kick ass engineering team", you know that one of the keys is the 3Cs...  Communication, Communication, Communication!  (My high school baseball coach taught me that one.)  As a engineering team, we believe that communication is one of the best things we can do for each other.  As any company grows communication becomes a larger and larger part of day to day and we see Doc Days as a way to ensure that we are communicating as clearly and accurately as we can.

Conclusion

These Dev Days have been a huge success for Bizo engineering.  We've even inspired other departments to have similar days (Marketing in particularly like these documentation days!).    We challenge you to go beyond the "Hack Day" and start thinking about other Dev Days that your engineering organization can benefit from.  

Thursday, April 5, 2012

Implementation driven interfaces?

I've recently encountered some interesting pagination in the Google Groups admin interface. It starts off simple enough, nothing exciting here...

Instead of the usual 'Previous', we see 'First' on the next page.

Are they just being clever? Knowing that there's only one previous page? No such luck…

We've reached the end of the list. I hope you've found what you're looking for, otherwise start over from the beginning!

One has to wonder, who designed this interface? You can only go forward. If you overshoot, it's back to square one, then click, click, click… It's clear it does not have users in mind at all.

My guess is that it's based on some limitation in the backend storage or query mechanism. The system only allows forward navigation of query results, so the interface simply mirrors that…

What an incredibly frustrating experience! I'll never take simple pagination for granted again.

It's a good reminder to think about your users and how they will interact with the system. Mirroring the programming interface rarely works.

Wednesday, April 4, 2012

Capturing Client Side JS Errors on AWS

I saw a post go by on Hacker News this morning discussing capturing and reporting on client side errors. We have been doing this for a long time and I wanted to share our approach.

Background
Quick background, we have two major types of javascript that our customers and partners may use: analytics tags and ad tags. Both tags are javascript and share the same error capture code.

Another quick note is that we run on Amazon Web Services so this approach is based on some of these services including S3, CloudFront and EMR.

Implementation
Our client side JS is compiled from Coffeescript. I've created a couple of gists to show you what the error logging code looks like in Coffeescript.




Details
The example shows our ad tags trying to execute surrounded by a try/catch that captures the error and eventually results in loading an image appending the relevant error metadata.

AWS Details
The image that is loaded actually lives on CloudFront. The CloudFront distribution is setup with logging which means that requests are logged and delivered to a specified S3 bucket (usually within 24 hours). Every day we run an EMR job against the CloudFront request logs that generates a report summarizing the errors. And that is it. Pretty simple and this approach has worked for us.

Pre-emptive "this isn't perfect" response
Some of you may be thinking, "you may not get all requests!". CloudFront logs are not supposed to be used for 100% accurate reporting (although nothing is really 100%). In our case, we don't need to capture all errors rather we are looking for directional information.

Monday, April 2, 2012

Creating Plug-ins for Sublime Text 2


I have been trying out Sublime Text 2 as my text editor lately, and I'm loving the simplicity, so I figured I would try out creating a plug-in for it. I was pleasantly surprised at how easy it is, which is an important step towards it becoming my new editor of choice. I wanted to take some steps towards creating something along the lines of rinari, but for Scala... in Sublime Text. I was able to fairly easily easily create a plug-in that allowed me to run the Scala Test that was currently open in the editor, or run all Scala Tests in the (inferred) project folder, or switch back and forth between a test and the code under test, or quickly navigate to any scala file in the project folder with a few keystrokes. This post will show you how to create a new plug-in for Sublime Text 2, which uses all the API features that I needed to implement that functionality.

Create a new plug-in

Step 1. Install Sublime Text 2 (see link above). Its free to try, and fairly cheap to buy. A month or so after you download it, it basically becomes nag-ware until you finally manage to overcome your stingy developer impulses and plunk down the $59 to buy it. Also, unlike other similar text editors (ahem.. TextMate!) it actually runs on Windows and Linux, as well as Mac OSX.
Step 2. Create a new folder for your plug-in. On Mac OSX, this goes under your home folder in ~/Library/Application Support/Sublime Text 2/Packages/{PLUGIN_NAME} (where in my case, {PLUGIN_NAME} was "ScalaTest").
Step 3. Create a python file which will contain the code for the plug-in. (Name it whatever you want, as long as it ends in ".py" ;-) Here is a really basic plug-in (borrowed from this plug-in tutorial, which you should read after this):
import sublime, sublime_plugin

class ExampleCommand(sublime_plugin.TextCommand):
  def run(self, edit):
    self.view.insert(edit, 0, "Hello, World!")
Right, so, as I mentioned: Sublime Text 2 plug-ins are written in Python. Don't worry too much if you're not familiar with Python... I wasn't either prior to starting this experiment, and it didn't prove to be too much a problem. (I did have a couple Python books laying around, but I'm sure the same information is on the tubes.) Its fairly easy to pick up, and has some similarities to Ruby, in case that helps. So the code above creates a command called "example" which is defined by a class that inherits from Sublime Text's "TextCommand" class. (Sublime Text 2 maps the title-case class names to underscore-delimited command names, and strips the "Command" suffix.) All the plug-in does is insert the text "Hello, World!" at the beginning of the file open in the editor.
(Note: Sublime Text 2 will detect that you created a Python file under its plug-in folder and automatically loads it.)
Step 4. Run your example. Hit Ctrl+Backtick to open the python interpreter within Sublime Text 2. Run your command by typing in this:
view.run_command("example")
The open buffer will now include the aforementioned greeting. You could bind it to a key-combination easily enough, but hey, it doesn't do anything cool yet, right?, so we'll hold off on the key bindings until the end.

Make it do something cool

So that you can see these approaches in action, I uploaded my nascent ScalaTest plug-in to github:https://github.com/patgannon/sublimetext-scalatest. Note that this plug-in will currently only work with projects that use Bizo's standard folder structure, and has a hard coded path to the scala executable, so its not ready to be used as-is. I hope to clean it up in the future and make it more generically applicable, but for now, I've only shared it to add a bit more color to the code snippets in this section.

Run a command on the current file

The name of the file currently open in the editor can be obtained with this expression: self.view.file_name(). In my plug-in, I use that to infer a class name, the project root folder, and path to the associated test (using simple string operations).
You can create an output panel (in which to render the results of running a command on the open file) by calling: self.window.run_command("show_panel", {"panel": "output.tests"}) (where "output.tests" is specific to your plug-in). In my plug-in, I created the helper methods below to show the panel and clear out its contents. (See the BaseScalaTestCommand class in run_scala_test.py). Note that this code was derived from code I found in theSublime Text 2 Ruby Tests plug-in.
 def show_tests_panel(self):

  if not hasattr(self, 'output_view'):

   self.output_view = self.window().get_output_panel("tests")

  self.clear_test_view()


  self.window().run_command("show_panel", {"panel": "output.tests"})



 def clear_test_view(self):

  self.output_view.set_read_only(False)

  edit = self.output_view.begin_edit()


  self.output_view.erase(edit, sublime.Region(0, self.output_view.size()))

  self.output_view.end_edit(edit)

  self.output_view.set_read_only(True)
(Note: I don't recommend copy/pasting code directly from this blog post, because the examples are pasted in from github, which messes up the indentation, which is a real problem in Python; instead, clone the github repository and copy/paste from the real file on your machine.)
To actually execute the command, I use this code in my run method, after calling show_tests_panel defined above (note that you will need to import 'subprocess' and 'thread' at the top of your plug-in file):
  self.proc = subprocess.Popen("{my command}", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

  thread.start_new_thread(self.read_stdout, ())


...where {my command} is the shell command I want to execute, and read_stdout is a method I defined which copies the output from the process and puts it into the output panel. Its defined as follows (and calls the append_data method, also defined below):
 def read_stdout(self):

  while True:

   data = os.read(self.proc.stdout.fileno(), 2**15)



   if data != "":

    sublime.set_timeout(functools.partial(self.append_data, self.proc, data), 0)


   else:

    self.proc.stdout.close()

    break
 def append_data(self, proc, data):

  self.output_view.set_read_only(False)

  edit = self.output_view.begin_edit()

  self.output_view.insert(edit, self.output_view.size(), data)


  self.output_view.end_edit(edit)

  self.output_view.set_read_only(True)
(Note: Depending on the command you're running, you may also want to capture the process' stderr output, and also put that into the output panel, using a variation of the approach above.)

Using the "quick panel" to search for files, and opening files

The "quick panel" (the drop-down which lists files when you hit command-T in sublime-text) can be extended to have plug-in specific functionality, which I used to create a hot-key for quickly navigating to any Scala file under my project folder. (See the JumpToScalaFile class in run_scala_test.py.) One of the plug-in examples I saw using the quick panel sub-classed sublime_plugin.WindowCommand instead of TextCommand. This results in a plug-in which can be run without any files being open. The flip side of that, though, is you don't get the file name of the currently open file, which in my case, is required to infer the base project folder for which to search for files. Thus, all my plug-ins sub-class TextCommand. To open the quick panel, execute: sublime.active_window().show_quick_panel(file_names, self.file_selected). file_names should be a collection of the (string) entries to show in the quick panel. Note that the entries don't have to be file paths, just a convenient identifier to show the user (in my case, the class name). file_selected is a method you will define which will be called when a user selects an entry in the quick panel. Here's how I defined it:
 def file_selected(self, selected_index):

  if selected_index != -1:

   sublime.active_window().open_file(self.files[selected_index])


self.files is an array I created when populating the quick panel which maps an index in the quick panel to a file path. I then use sublime.active_window().open_file to open that file in Sublime Text.
I also used that same method (open_file) in the plug-in that automatically navigates back and forth between a test file and the code under test. That plug-in also makes use of the sublime.error_message method, which will display an error message to the user (if no test is found, for example).

Create keystroke bindings

To bind your new plug-in commands to keystrokes, create a file in your plug-in folder called Default (OSX).sublime-keymap. This will contain the keystrokes that will be used on Mac OSX. (You would create separate files for use on Windows and Linux.) It is a simple JSON file that maps keystrokes to commands. Lets see an example:
[

 { "keys": ["super+shift+e"], "command": "jump_to_scala_file" }

]
This example will bind Command+Shift+e to the "jump_to_scala_file" command (defined in the JumpToScalaFileCommand class in any plug-in). If you have multiple key-mappings, you would create multiple comma-delimited entries within the JSON array. (See the example in my plug-in.) In order to reduce the possibility of defining keystrokes that collide with keystrokes from other plug-ins, I defined mine in such a way that they're only available when the currently open file is a Scala file. Here is the rather verbose (ahem, powerful) syntax that I used to do that:
[

 { "keys": ["super+shift+e"], "command": "jump_to_scala_file", 

  "context" : [{"key": "selector", "operator": "equal", "operand": "source.scala", "match_all": true}]}


]

Conclusion

Over the years, I've grown to prefer light-weight editors (such as emacs or Sublime Text 2) over more heavy-weight IDEs (such as Eclipse or Visual Studio) because they don't tend to lock up in the middle of writing code and/or crash sporadically, and I generally don't need a lot of whiz-bang features when I'm coding these days. I used emacs (and rinari) for doing rails development for a year or so, but the basic key-strokes (compared to the de-facto text editing standard key-strokes) and the undo/redo functionality always seemed a bit awkward, especially when you wind up switching back and forth between that and other text editors. Also, the language for creating plug-ins is Scheme (a dialect of Lisp), which to me isn't very convenient for these sort of things.
I was really pleased with my foray into creating plug-ins for Sublime Text 2, and combined with its general ease of use, I've decided its now my new favorite editor. Using an editor that's this easy to significantly customize seems like it could be a real productivity win over time. Given the fairly rich list of plug-ins already available, I think the future is bright for Sublime Text 2. Below are a list of resources I found helpful during this process, including said list of plug-ins.

Resources

Unofficial list of plug-ins:
http://wbond.net/sublime_packages/community

Tuesday, March 13, 2012

A Short Script for Logging into Interactive Elastic MapReduce Clusters

Elastic MapReduce is great, but the latencies can be painful.  For me, this is especially true when I'm in the early stages of developing a new job and need to make the transition from code on my local machine to code running in the cloud -- the ~5 minute period between starting up a cluster and actually being able to log on to it is too long to sit there staring at a blank screen and too short to effectively context switch to something else in a useful way.

My current solution is to allow myself to get distracted but to drag myself back to my EMR session as soon as it's available.  Adding some simple polling plus a sticky growl notification to my interactive-emr-startup script does the trick quite nicely:


#!/bin/bash


if [ -z "$1" ]; then
  echo "Please specify a job name"
  exit 1
fi


elastic-mapreduce \
  (... with all of my favorite options ...) \
| tee ${TMP_FILE}


JOB_ID=`cat ${TMP_FILE} | awk '{print $4}'`
rm ${TMP_FILE}


# poll for WAITING state
JOB_STATE=''
MASTER_HOSTNAME=''
while [ "${JOB_STATE}" != "WAITING" ]; do
  sleep 1
  echo -n .
  RESULT=`elastic-mapreduce --list | grep ${JOB_ID}`
  JOB_STATE=`echo $RESULT | awk '{print $2}'`
  MASTER_HOSTNAME=`echo $RESULT | awk '{print $3}'`
done
echo Connecting to ${MASTER_HOSTNAME}...


growlnotify -n "EMR Interactive" -s -m "SSHing into ${MASTER_HOSTNAME}"


ssh $MASTER_HOSTNAME -i ~/.ssh/emr-keypair -l hadoop -L 9100:localhost:9100


One of my personal productivity goals for the year is finding little places like this that I can optimize with a short script.  This particular one has rescued me from the clutches of HN more than once!

On Code Reviews and Developer Feedback

There's a great post from last week at 37signals, Give it five minutes:

While he was making his points on stage, I was taking an inventory of the things I didn’t agree with. And when presented with an opportunity to speak with him, I quickly pushed back at some of his ideas. I must have seemed like such an asshole.

His response changed my life. It was a simple thing. He said “Man, give it five minutes.” I asked him what he meant by that? He said, it’s fine to disagree, it’s fine to push back, it’s great to have strong opinions and beliefs, but give my ideas some time to set in before you’re sure you want to argue against them. “Five minutes” represented “think”, not react. He was totally right. I came into the discussion looking to prove something, not learn something.

There’s also a difference between asking questions and pushing back. Pushing back means you already think you know. Asking questions means you want to know. Ask more questions.

This is such a great outlook and a great way to approach the discussion of feedback for code reviews and design reviews.

It's surprising how little time development teams devote to training, or even internal discussion on effective feedback. As developers, we are constantly engaged in this kind of communication: white-boarding sessions, spec reviews, design reviews, code reviews. We're expected to give and receive feedback on a daily basis, but few of us are properly prepared for it. Not only do we lack the training, but we have many negative examples to draw from. Who hasn't been a part of a design review where tempers flare? Properly giving feedback is something that requires constant attention and practice. Receiving feedback can be just as difficult.

Culture of Communication

One of the major pillars of our engineering culture at bizo is "the 3 Cs": Communication, Communication, Communication.

We've tried hard to build a team of engineers that are eager to receive feedback, humble about their abilities, objective and gracious with their feedback, and freely giving of their own knowledge and experience. We see communication as a prerequisite for building a world-class team and developing high-quality code. You often hear the phrase "strong opinions, weakly held," and that is the kind of culture we have tried to build.

Communication is hard. It takes real team agreement and commitment to continued work to keep this culture alive and well. It's important the team views effective communication as important and that the culture supports it.

Code Reviews

Code reviews are something that can easily be approached from the wrong perspective, both as an author or reviewer.

As a reviewer, it can be easy to jump in and argue, to try and push 'your' solution (even though it may be equivalent), to push back instead of asking questions and trying to understand.

As an author, it's far too easy to get attached to your code, to your specific solution/naming/etc. It's also easy to feel like each comment is an attack on your ability, and that by accepting the feedback, this somehow means that you were wrong or did a bad job. Of course, nothing could be further from the truth!

At Bizo, we perform code reviews for every change. They are a major part of our culture of communication. In order to perform effective code reviews, it's important to have some shared guidelines that help support effective communication.

Here are some guidelines we've found to be helpful for performing code reviews:

What is a code review

  • A careful line-by-line critique of code by peers
  • happens in a non-threatening context
  • goal is cooperation and mutual learning, not fault finding

Code reviews are a team exercise to improve understanding and make the code better!

When people think of code reviews they usually think of catching bugs. Code reviews do occasionally catch bugs or potential performance problems, but this is rare.

Just as important is fostering a shared understanding of the code and exposure to new approaches, techniques, and patterns. Seeing how your peers program is a great way to learn from them.

Ensuring coding standards and style guides is another way code reviews help. Working on a team it's important to keep readability and quality high using a shared vocabulary.

As an author

As an author, it's important to view each comment as a new opportunity to improve your code. Instead of jumping into defense mode, take a step back and think. Try to approach the code again for the first time with this new perspective. Your team has a lot of experience and varied backgrounds -- draw from them! They are there to help you. Use the gift of their experience and knowledge to improve the code.

Trust the team, and view all comments as action items. Some changes can seem arbitrary, especially when it comes to naming and organization. Unless there's a strong reason, tend to agree with your reviewers. If a reviewer finds something confusing, it is confusing! Code spends most of its life in maintenance and programming is a team sport. Remember that they are your audience, and you want them to be able to understand your code at 4am after a system crash.

As a reviewer

As a reviewer, it's important to take the time to understand the code, think, and ask questions to understand the code before providing feedback. The author probably spent a lot more time thinking about the problem and the approach over the course of the project.

Be strict on coding standard and style guide violations. The real cost of software is maintenance (80% according to Sun). It's important the code is easily understood by the team.

Be gentle on personal preferences. If it's not a standard violation and just a matter of personal preference, defer to the author. It's okay to present your perspective, but mention that it's just a preference and not meant to be taken as an action item.

Trust the author. It's often the case that there are many valid approaches to a problem. It's great to present alternative approaches and discuss pros/cons of various approaches. If you see alternative solutions, bring them up! When discussing alternatives, make sure to listen to the author. Remember they are the subject matter expert and you are working together on the same team.

It takes work!

Communication is hard! It's easy to screw-up. It's easy to go into attack or defense mode when you're passionate about what you're doing. It's really something we all need to remind ourselves to work on every day. It's something we need to periodically remind ourselves as a team. Try to view each review as an opportunity to practice these guidelines. Just remember to take a step back, think, and ask questions.

Monday, March 12, 2012

Fault Tolerant MongoDB on EC2

While working on a project at Bizo I needed to connect a Rails app to a MongoDB backend both of which run in Amazon's Cloud (EC2). At Bizo we have a policy to not use non Amazon services when possible (to limit risk) - so we normally run most of our services straight off of EC2. I'd like to share what I've learned as best practices throughout the experience as I hope it might save some time and frustration for others.

Primer

Replica sets are the preferred way to run a distributed, fault tolerant MongoDB service. But as with any distributed system, nodes will eventually fail. Now replica sets are pretty good at handling failures, but they can't save you if too many nodes fail.
Specifically a replica set requires a minimum of two nodes to function at all times (1 primary and 1 secondary node). Thus a good rule of thumb is to run **at least 3 nodes** in a replica set, that way if a node fails your database service doesn't go down with it. The Rails app I was working with doesn't experience enormous amounts of traffic so 3 m1.large (64bit) nodes were sufficient for my needs. What follows is a rundown of our setup and how it handles common needs of fault tolerant systems.

Best Practices


Minimize Failure with AutoScaling, Availability Zones, CloudWatch and EBS Volumes

  • Use Autoscaling Groups, CloudWatch and EBS Volumes to replace failed nodes as soon as they go down. Since we run three nodes, our replica set is insulated from failure due to a single node crashing. But if two nodes crash the replica set goes with them. To solve this we use Cloudwatch alarms to trigger the Autoscaling Group whenever a node goes down - that way a new replacement node is automatically brought online within a few minutes of a failure to reduce the risk of nodes sequentially failing. Additionally each node stores it's data on an EBS Volume (network attachable hard drive) - that way when a node fails, it's replacement doesn't startup with missing data - it simply mounts the previous node's EBS.
  • To protect against multiple nodes failing simultaneously run each node in a separate availability zone. The above isn't sufficient to protect against things like hardware failures as all 3 instances could wind up on the same hardware. Running each node in a separate availability zone guarantees that our mongo instances run with a reasonable amount of separation (eg. they don't all end up on the same hardware box). Ideally you'd run each node in its own region (separate data center), but this causes headaches trying to configure firewalls as Amazon does not allow security groups to be used across multiple regions (see security below). So unless you want to setup a VPN for cross region communication - you're probably better off just running in separate availability zones.
    Assuming you've created the group and are running three nodes, each in a separate availability zone you can configure the auto scaling group using Amazon's command line tools like so:
    
    as-update-auto-scaling-group my-mongo-service \
    --region us-east-1 \
    --availability-zones us-east-1a us-east-1b us-east-1c \
    --max-size 3 \
    --min-size 3 \
    --desired-capacity 3
    
    
    Now if any node fails then a new one will startup to take its place in the proper availability zone.

If Everything Fails, have backups

It's always good to have backups just in case something really bad happens. Fortunately since we use EBS Volumes this is really easy - we create nightly snapshots of the primary node's EBS Volume on our cron server using Amazon's command line tools (ec2-create-snapshot). These snapshots are persisted to S3 and we can easily restore our replica set from these backups.

Use Elastic IPs

As nodes fail and are replaced you want both your Replica set and your database clients to be able to find and connect to the new nodes. The easiest way to do this in Amazon is to use Elastic IPs - special* static ip addresses that can be assigned to individual instances. Since each instance runs in a separate availability zone we just need one Elastic IP per zone. When a new node starts up to replace a failed instance, it checks which zone it was started in and assigns itself the matching Elastic IP. Both the client and replica set configuration should point at the Elastic IP address - that way failures and startups of new nodes will be seamless to your app. This is because cross-security group openings in the firewall need to use internal (not external) addresses and the DNS url in the console will resolve to an internal ip address from an EC2 instance.

Security

This is where the headaches can start. Ideally you want to restrict access to your MongoDB instances to just your client application using Amazon's security groups. The way we normally set this up is to give your mongo instances a security group, say mongo-db-prod and your client app a security group, say cool-app-prod. Then mongo-db-prod would grant access on port 27017 (default mongodb port) to security group: cool-app-prod. Unfortunately what's not documented very well is that if you use the **external Elastic IP** addresses in your configuration it **will not work** with security groups! Instead you have to use the Elastic IPS DNS url (found in Amazon's web console) for security groups to work properly.

A Final Caveat

One thing to be careful of is if you require more than 5 nodes in a replica set you'll run into a problem using Elastic IPs - Amazon by default only allows 5 eips per region. You'll need to either ask Amazon to increase this limit on your account or seek out an alternative setup.
Well, that's it but If you have another setup for running MongoDB on EC2 I'd love to here it. Until next time.

Friday, February 17, 2012

Building a Product in Just 8 Hours

Recently at Bizo, we decided to try a new kind of hack day. Previously during hackdays our engineers worked individually on their own project(s). But on our last hack day we decided to try something new – The 8 Hour Product Challenge.
We would build and launch a completely new product in the course of a normal workday (9-5pm). “Launching” meant this product had to be running publicly on the internet by 5pm – no excuses, no “Wait! I need 5 more minutes” – whatever was there had to be deployed. In short the experience was fantastic and I can’t wait to do it again. Here’s a breakdown of the experience:

Initial Meeting 9:30am

Organizing developers for a meeting of any kind is like trying to heard cats. But if it’s a meeting before 11:00am you’re not herding regular cats, you’re herding sleepy, fat cats with one leg and half an ear. Somehow after a lot of cattle prodding by our VP of engineering our team eventually managed to shuffle its way into the conference room like the decaffeinated zombies we were and get to work.
We decided to build a stealth product. The system would use Bizo’s rich business data to personalize special content for visitors based on things like their industry, company size and seniority. The goal for the end of the day was to have a small webapp up and running on Amazon’s servers.
After a bit more discussion, we decided to split tasks up into five groups of two engineers:
  • Data Discovery Team – Find relevant items for users by using our B2B business data & network.
  • Scraping Team – given a url representing an item scrape the page contents and store them for later use
  • Data Classification Team – extract relevant data from the HTML source of the previously scraped urls.
  • Backend Team – backend architecture for the webapp that fetches serves the content from that was generated in the previous steps
  • Frontend Team – frontend design + javascript that makes the app functional
Engineers were assigned more or less randomly, with the exception of myself – I was assigned to the frontend team directly. During the course of our initial planning meeting, we (myself included) often found ourselves becoming sidetracked with feature bloat, premature scalability concerns and a myriad of other things not essential to our MVP. Fortunately one of my coworkers (Stephen) was smart enough to enforce timeboxing the meeting to one hour – eventually we got things back on topic and designed the critical components before time ran out.

Start Work 10:30am

Once the teams were assigned, we all jumped in and started to work on our relevant tasks. My partner, Darren and I immediately started out by sketching ideas on paper for our design. I can’t stress enough how important sketching is for being able to rapidly prototype a product – a trick I picked up back when I interned over at ZURB. Only after we had some solid sketches did we move into Photoshop mockups. Meanwhile the other teams were all furiously programming their parts of the application:
  • Data Discovery Team Worked out a simple ranking algorithm for data and started writing the Hive script to extract it.
  • URL Scraping Team Started out in Scala hacking up a script to scrape & download url content.
  • Data Extraction Team Decided to try out the Pismo gem to extract summaries and titles from scraped html content.
  • Backend Team Was working on getting a sweet Scalatra webapp up and running

Lunch Time & Status Updates 12:30pm

By lunchtime everything seemed to be coming along nicely. On the frontend had completed our Photoshop mockup and had just began writing some basic css styles. All the other teams reported making good progress on their tasks, with no major snags in the foreseeable future (betcha you never heard that one before…).

Afternoon 1:30pm

My frontend partner and I powered through our post lunch food coma and were able to move into begin wiring up the ui using CoffeeScript in conjunction with Dependence.js. My teammate and I decided to give pair programming a shot. He has always been more of an Emacs kind of guy, while I prefer vi, but in the interests of learning I decided to try Emacs for the rest of the day – I now know why he’s always so worried about contracting carpel tunnel syndrome :).
Using some sample data generated by the first three teams we were able to get a rough ui working pretty quickly. Our side of things turned out to be pretty straightforward and involved three AJAX requests. One was to retrieve a list of items grouped by segment from the Scalatra web server, one to get a list of the current targetable segments from the Bizo API and another call to the API to retrieve a visitor’s business segments (bizographics).
The only snag we hit was running into a race condition – originally we attempted executing all three requests simultaneously. In reality we had to wait to get the list of segments before getting the visitor’s profile. Darren and I just looked at each other and shrugged, then we indented the third API request a few spaces in our CoffeeScript code – race condition solved! Yes that’s correct we fixed a race condition by indenting some code, don’t judge – it was a hackday.
The other teams all seemed to be doing well, the scraping team discovered the Scala Collection’s magical par(), which turns normal data structures into parallel ones, they almost peed their pants with joy. At this point the backend team had completed the Scalatra app and was working on setting up our eventual deployment to EC2 using our custom infrastructure, cowboy.

The Home Stretch & Deployment 4:30pm

Right around 4:30 we ran into a major problem. There had been a miscommunication regarding the necessary format of the JSON file needed by the frontend, and our data was coming through to the app in a format that just wouldn’t work. We had to scramble and hash things out with the other teams quickly before we hit our 5pm deadline. Thanks to a major push by the Data Extraction Team we were finally able to get everything in place. The product worked! – it wasn’t the most polished app ever, but what we had accomplished in just one day was pretty amazing. The product was deployed on EC2 and presented internally within Bizo – it was met with a lot of excitement and Bizo will probably be releasing it publicly in the weeks to come.

Closing Thoughts

Overall the experiment turned out to be a smash hit. Looking back on the experience there are a bunch of things we could have done better. In retrospect we got sidetracked a bit too much on non essential feature ideas when we really should have been spending time clarifying the format of the data as it passed through each team’s project – this was something that came back and bit us near the end of the day. But mishaps aside it’s really amazing what you can accomplish with 9 other talented people in a single day.
I can’t recommend trying this with your team enough – what do you think, is your engineering team up for The 8 Hour Product Challenge? If you’re interested in the technologies we used throughout the day, see below.

Appendix, Technologies Used

  • News Discovery Team – Hive, Amazon EC2, Amazon S3
  • URL Scraping Team – Scala
  • Data Extraction Team – jRuby, gems of note: pismo, right_aws (for Amazon S3)
  • Backend Team – Scalatra, Amazon S3
  • Frontend Team – Photoshop, HTML5, Sass, CoffeeScript, jQuery, dependence.js

Monday, January 30, 2012

work at Bizo (looking for some good engineers)

We’re a small, disciplined team that gets a lot done. Our platform processes billions of page views monthly and 100s of terabytes of data so we have lots of fun problems to tackle. We believe in teamwork and communication: comments, design reviews, code reviews for every change, weekly tech talks. We believe in giving developers ownership over projects. We believe Engineering is more than coding. We have fun and keep the beer fridge well stocked.

We have customers, are well funded and recently named the forth fastest growing private company in the San Francisco Bay Area.

We are looking for motivated problem solvers with an entrepreneurial / hacker spirit.

If you're a reader of this blog, you already know our technology stack. Some highlights: Scala, Java, Javscript, Ruby, AWS (pretty much every service), Hadoop/Hive, GWT, MongoDB, Solr, etc.

If you're interested, please apply on stackoverflow.

Wednesday, January 18, 2012

Using GenericUDFs to return multiple values in Apache Hive

A basic user defined function (UDF) in Hive is very easy to write: you simply subclass org.apache.hadoop.hive.ql.exec.UDF and implement an evaluate method.  We've previously written about this strategy, and it works well for most simple cases.

The first case where this breaks down is when you want to return multiple values from your UDF.  For me, this often arises when we have serialized data stored in a single Hive field and want to extract multiple pieces of information from it.

For example, suppose we have a simple Person object (leaving out all of the error checking code):

case class Person(val firstName: String, val lastName: String)


object Person {
  def serialize(p: Person): String = {
    p.firstName + "|" + p.lastName
  }


  def deserialize(s: String): Person = {
    val parts = s.split("|")
    Person(parts(0), parts(1))
  }
}

We want to convert a data table containing these serialized objects into one containing firstName and lastName columns.

create table input(serializedPerson string) ;
load data local inpath ... ;

create table output(firstName string, lastName string) ;


So, what should our UDF and query look like?


Using the previous strategy, we could create two separate UDFs:


insert overwrite table output
select firstName(serializedPerson), lastName(serializedPerson)
from input ;


Unfortunately, the two invocations will have to separately deserialize their inputs, which could be expensive in less trivial examples.  It also requires writing two separate implementation classes whose only difference is which field to pull out of your model object.


An alternative is to use a GenericUDF and return a struct instead of a simple string.  This requires using object inspectors to specify the input and output types, just like in a UDTF:


class DeserializePerson extends GenericUDF {
  private var inputInspector: PrimitiveObjectInspector = _

  def initialize(inputs: Array[ObjectInspector]): StructObjectInspector = {
    this.inputInspector = inputs(0).asInstanceOf[PrimitiveObjectInspector]

    val stringOI =    PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(STRING)

    val outputFieldNames = Seq("firstName", "lastName")
    val outputInspectors = Seq(stringOI, stringOI)
    ObjectInspectorFactory.getStandardStructObjectInspector(outputFieldNames,
      outputInspectors)
  }

  def getDisplayString(children: Array[String]): String = {
    "deserialize(" + children.mkString(",") + ")"
  }

  def evaluate(args: Array[DeferredObject]): Object = {
    val input = inputInspector.getPrimitiveJavaObject(args(0).get)
    val person = Person.deserialize(input.asInstanceOf[String])
    Array(person.firstName, person.lastName)
  }

}


Here, we're specifying that we expect a single primitive object inspector as an input (error handling code omitted) and returning a struct containing two fields, both of which are strings.  We can now use the following query:


create temporary function deserializePerson as 'com.bizo.udf.DeserializePerson' ;


insert overwrite table output
select person.firstName, person.lastName
from (
  select deserializePerson(serializedPerson)
  from input
) parsed ;


This query deserializes the person only once but gives you access to both of the values returned by the UDF.


Note that this method does not allow you to return multiple rows -- for that, you still need to use a UDTF.