Thursday, April 14, 2011

Hive Unit Testing


Hive has become an extremely important component in our overall software stack. We have numerous ‘mission-critical’ reports that are generated using Hive and want to make sure we can apply our testing processes to Hive scripts in the same way that we apply them to other code artifacts.

A few weeks ago, I was tasked with finding an approach for unit testing our Hive scripts. To my surprise, a Google search for ‘Hive Unit Testing’ yielded relatively few useful results.

I wanted a solution that would allow us to test locally (vs. a solution that would require EMR). Where possible, I prefer local testing because it’s simpler, provides more immediate feedback, and doesn’t require a network.

After reading this post, you will (hopefully) know how to run Hive unit tests in your own environment.

The Approach

After performing some research, I decided on an approach that is part of the Hive project itself.  At a high level, the solution works in the following way:
  • Start up an instance of the Hive CLI
  • Execute a Hive script (positive or negative case)
  • Compare the output (from the CLI) of the script compared to an expected output file
  • Rinse and repeat
The rest of this post discusses the specific steps required to get this solution running in your own environment.

Set up Hive Locally

The first step is to create some Ant tasks for setting up Hive locally. Here’s a snippet of Ant that shows how to do this:

You should now be able to execute ‘ant hive.init’ and have Hive available in the tools directory.

Generate test cases

The developer is responsible for providing the .q Hive files that represent the test cases. There is a code generation step that will create JUnit classes (one for positive test cases, one for negative test cases) given a set of .q files. The Ant snippet below shows how to generate the test classes:

Here are some notes about the key variables above:
  • hive.test.template.dir - the directory where the velocity templates are located for the code generation step.
  • target.hive.positive.query.dir - the directory where positive test cases are located.
  • target.hive.negative.query.dir - the directory where negative test cases are located.
  • hive.positive.results.dir - the directory where expected positive test results are located. The name of this file must be the name of the query file appened by ‘.out’. For example, if the test query file is named hive_test.q then the results file must be named hive_test.q.out.
  • hive.negative.results.dir - the directory where expected negative test results are located.
  • qfile - This variable should be specified if you want to generate a test class with a single test case. For example, if you have a test file named hive_test.q, then you would set the value of this property to hive_test (e.g. ant -Dqfile=hive_test hive.gen.test).
  • qfile_regex - Similar in functionality to qfile, this variable should be set to a regular expression that will match the test files that you want to generate tests for.
The test classes are generated from velocity template files. You can find examples of the template from the Hive codebase here:

The above files can basically be used as-is, but you will need to provide your own Test Helper class, QTestUtil, and update its package location accordingly in the templates.


QTestUtil contains code for:
  • starting up hive
  • executing a query file
  • comparing the results to expected results
  • running cleanup between tests
  • shutting down hive
You can find the one from the Hive project here:

The main modifications you will want to make to this file are deletions as there is some Hive project specific set up code that you will not need in your environment.

Executing the tests

After you have generated the tests, you can execute them by creating a target with the junit task. Here is some sample Ant for doing this:


This post outlined a solution for unit testing Hive scripts. Another nice aspect of this approach that I failed to mention is that it’s based on JUnit so you can use your existing code coverage tools with it (we use Cobertura) to get coverage information when testing custom UDFs. Also, I should mention that I used Hive 0.6.0 when putting this together.

No comments: