Friday, January 13, 2012

Interactive Hive sessions, Elastic MapReduce, and GNU screen

One extremely annoying quality of using Hive interactively on EMR (or any other remote system) is that your sessions will die if you lose your connection to the server.  Once this happens, your ssh session will end, terminating both your Hive session and any queries that may currently be running.

In most cases, this happens when I'm waiting for a query to execute and I need to move from one place to another, whether from my desk to a conference room or from the office to home.  When I can predict (or know) that I'm going to lose my connection and just want to be able to reconnect to Hive later, the best option I've found is to run Hive inside of GNU screen.

I'm definitely a screen newbie, but there are really only three things you need to know:

1. As soon as you log in for the first time, install screen and start it up:

sudo apt-get install screen
screen
hive

2. When you're (temporarily) done interacting with Hive and want to stick it in the background, tell screen to detach Hive from your current session by pressing "Ctrl-a" then "d".  You may now log out from your EMR node.

3. When you're ready to resume your Hive session, simply log back on and tell screen to reconnect to the most recent session:

screen -r

Screen does a whole lot of other stuff, but simply allowing graceful reconnection to Hive sessions is definitely worth the price of entry.

For comparison, some other things you could do to work around this problem are using nohup, suspending/putting the job in the background and using disown, or using an even more advanced tool like  tmux.

2 comments:

Pat Gannon said...

Can you retrieve the output that happens in hive when the screen is detached? Does it show it all when you reconnect?

Alex Boisvert said...

Yes, some more information on the scroll-back buffer here:
http://www.samsarin.com/blog/2007/03/11/gnu-screen-working-with-the-scrollback-buffer/