Monday, August 26, 2013

SCM Migration

We happily used Atlassian’s hosted OnDemand service for source code management with the following setup
  • Subversion: source control management
  • FishEye: source code browsing
  • Crucible: code reviews
  • Jenkins (hosted on EC2): continuous integration and periodic jobs (http://dev.bizo.com/2009/11/using-hudson-to-manage-crons.html)

However, Atlassian is ending their OnDemand offering for source code management in October so it was time for a change. The good news: we were wanting to migrate to git anyway. The bad news: we had around hundreds projects in our subversion repository and needed to break them up into separate git repositories.

We switched on a Thursday morning with minimal developer interruptions, now we're on a new setup
  • Bitbucket: source control management and code browsing
  • Crucible (hosted on EC2): code reviews
  • Jenkins (hosted on EC2): continuous integration and periodic jobs


How'd we do it? Read on, my friend.

Problem
Move hundreds of projects (some with differing branching structures) to an equivalent number of git repositories. And change hundreds of Jenkins job configurations from pulling code out of subversion to pulling code from git. And set up a new Crucible instance for code reviews for the hundreds of repos. All without disrupting the dev team's work. For subversion, this meant moving the code, including branches, and commit history from subversion into Bitbucket. For Jenkins, it meant changing the job configs to point at the equivalent git repository with the same code and branch as the old subversion configuration. This blog post focuses on the subversion to git migration. Fixing the Jenkins configs will be covered in a later blog post.

Subversion to Git
Converting a single repository from subversion to git is fortunately straight forward due to the terrific tool git-svn (https://www.kernel.org/pub/software/scm/git/docs/git-svn.html).
The challenging part was determining how each project configured branches. In subversion, branches are just another subdirectory the repository. Basically any level of the directory hierarchy can support branches. You can pretty much put them anywhere. Git, however, only supports branches at the root of the repository. Git-svn allows you to tell git what directory the branches are in, but first you have to find what directory that is.

Our subversion repositories followed two primary branching structures: branch at the module level or branch at the project level.

One layout that I will call "module level". Module level projects had a separate branch point for each module in the project. These projects were usually several loosely connected modules that could be deployed separately or libraries that were related but could be imported independently. Module level projects looked like this:
- svn/<project>/trunk/<module1>
- svn/<project>/trunk/<module2>
- svn/<project>/branches/<module1>/<branch1_for_module1>
- svn/<project>/branches/<module1>/<branch2_for_module1>
- svn/<project>/branches/<module2>/<branch1_for_module2>

"module level" projects mapped into a separate git repo for each module using this git-svn command:
git svn clone <svn_root> --trunk <project>/trunk/<module> --branches <project>/branches/<module> --tags <project>/branches/<tags> <module>

The other branching structure I’ll call "project level". These projects also had multiple modules, but the branches were defined such that each branch contained the entire project. These projects were usually separate modules for the domain layer, application layer and web layer or closely related applications that use the same database. Parts could perhaps be deployed separately but they often need to be deployed at the same time such as when the database schema changed. Project level projects looked like this:
- svn/<project>/trunk/<module1>
- svn/<project>/trunk/<module2>
- svn/<project>/branches/<branch1>/<module1>
- svn/<project>/branches/<branch1>/<module2>
- svn/<project>/branches/<branch2>/<module1>
- svn/<project>/branches/<branch2>/<module2>

"project level" projects mapped into a single git repo containing all modules using a git-svn command:
git svn clone <svn_root> --trunk <project>/trunk --branches <project>/branches --tags <project>/branches <project>

To automate the git-svn clones, I wrote a ruby script that used "svn ls" to find the list of all projects. Each project was assumed to be "module level" unless it was in a hard-coded list of known "project level" projects. It was important for this to be fully automated as the list of "project level" projects was not complete until near the end of the migration. It took several tries to make sure the migration was correct. Some projects unfortunately used both branching structures, which is not supported by git-svn. Some of these branches were abandoned anyway, but others were moved using "svn mv" to fit that project's standard branch structure.

Local Git to Bitbucket
Atlassian provided a jar (https://go-dvcs.atlassian.com/display/aod/Migrating+from+Subversion+to+Git+on+Bitbucket) to push a git-svn repository up to Bitbucket. The jar also can create an authors file from the subversion repository to map a subversion user to the values git needs for a committer - first name, last name and email address. This made scripting the Bitbucket upload for each repository straightforward. The jar also handles syncs to an existing Bitbucket repository so developers could continue committing to their svn projects and Bitbucket would automatically get updated. Note this only does fast forward syncs so the incremental sync stops working once commits were made directly to Bitbucket.

Crucible
Crucible is a tool to facilitate code reviews. It imports commits from your SCM tool, allows inline comments on the diffs and manages the code review life cycle of assigning reviewers, tracking who has approved the changes, and closing the review once approved. Crucible setup is fairly straightforward with a couple of caveats.

Crucible needs to access your repositories to pull in the commit history. There is no native support for pointing crucible at a Bitbucket team account and having Crucible automatically import each repository. There is an free add-on (https://marketplace.atlassian.com/plugins/com.atlassian.fecru.reposync.reposync) that works for an initial import, but initially it did not bring in new repositories that are added to the team account after the initial import. It turns out the update did not work because I was using a Bitbucket user that could not access the User list from the Bitbucket API. Changing the Bitbucket user to one with access to this API end point solves this problem. Incremental updates to the repository list are now working.

While Crucible supports ssh access to git repositories in general, I ran into the problem described here https://answers.atlassian.com/questions/34283/how-to-connect-to-bitbucket-from-fisheye. Basically, Crucible does not support Bitbucket's ssh URL format. Instead of using ssh, I had to use https to connect to the Bitbucket repositories. This means each repository configuration requires the Bitbucket username and password to be specified separately, which is not ideal.

Testing
After running git-svn clone on a few projects, I went ahead and pulled all the projects down with git-svn. The distributed nature of git helped testing because the entire repository could be represented locally without needing to upload it to any server to test the initial clones. However, cloning all the repositories took about 24 hours. During this time there was minimal CPU and I/O load so I multithreaded the cloning jobs using 16 threads. This improved the time to just 1.5 hours on only a dual core machine.

I was initially hesitant to upload all the repositories to Bitbucket because I did not want to have to manually delete the repos if there was a problem. However, I found the Bitbucket REST API (https://confluence.atlassian.com/display/BITBUCKET/Use+the+Bitbucket+REST+APIs). It is pretty well put together and was easy to use because it generally follows REST conventions. I've yet to find anything that can be done in the UI that can not be done in the API, which has been outstanding for adding additional niceties like adding commit hooks to push changes to crucible for each repository. For the purposes of migration, the best feature was deleting repositories. Knowing I could automatically clean up any mistakes provided the confidence to just let it rip. I actually ended up using this to clean up two false start migrations:
  • git-svn has a "show-ignore" command to translate files ignored by subversion into a .gitignore file. I initially added .gitignore to the git repositories. However, this meant every repository had a commit in Bitbucket and so would no longer accept changes from subversion. This was resolved by adding .gitignore to subversion before the conversion.
  • the first authors file I created was missing a few users. This was not discovered by noticing the Bitbucket commit history did not look as nice. It was nice to be able to just wipe it all out with a single command, fix the authors file, and redo the upload with a single command.

Post-migration
The time following the initial migration was when the automation really came in handy. A couple developers were out of the office during the cut over. They were able to make commits of their local work to subversion and then I could re-sync just those repositories even after other developers had begun working on other repositories in Bitbucket. This went very smoothly with no hand wringing or diff patching required to make sure local work was not lost.

Wrap-up

Overall the migration went off with no hiccups. We're still tweaking our preferred settings for git pushes and pulls to get to our ideal workflow, but we're happy to be using Bitbucket. Crucible does not integrate with Bitbucket as nicely as it did with subversion in our old setup. Hopefully Atlassian will continue to make improvements to this integration as we really like the Crucible code review workflow. I'm always impressed how automation begets automation. Once you've taken the step of automating part of the process, it is so much easier to see the next step. We are already seeing some benefits from the time spent interacting with the Bitbucket API as we're now able to add and modify commit hooks on all the repositories easily.