Wednesday, September 29, 2010

emr: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory

Moving one of our jobs from hive 0.4 / hadoop 0.18 to hive 0.5 / hadoop 0.20 on amazon emr, I ran into a weird error in the reduce stage, something like:


java.io.IOException: Task: attempt_201007141555_0001_r_000009_0 - The reduce copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:384)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at org.apache.hadoop.mapred.MapOutputFile.getInputFileForWrite(MapOutputFile.java:160)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2622)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2586)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.(UNIXProcess.java:148)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
... 8 more


There's some discussion on this thread in the emr forums.

From Andrew's response to the thread:


The issue here is that when Java tries to fork a process (in this case bash), Linux allocates as much memory as the current Java process, even though the command you are running might use very little memory. When you have a large process on a machine that is low on memory this fork can fail because it is unable to allocate that memory.


The workaround here is to either use an instance with more memory (m2 class), or reduce the number of mappers or reducers you are running on each machine to free up some memory.

Since the task I was running was reduce heavy, I chose to just drop the number of mappers from 4 to 2. You can do this pretty easy with the emr bootstrap actions.

My job ended up looking something like this:


elastic-mapreduce --create --name "awesome script" \
--num-instances 8 --instance-type m1.large \
--hadoop-version 0.20 \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "-s,mapred.tasktracker.map.tasks.maximum=2" \
--hive-script --arg s3://....../script

(relevant parts highlighted).


Tuesday, September 21, 2010

Salesforce and DART Synchronization

I’ve recently started some work that involves extending Salesforce for our Ad Ops team. For our most recent Hack Day, I decided to do a little project to continue learning about development with the Salesforce cloud platform, Force.com.

After thinking about what I wanted to work on, I decided to build a custom button that would allow a user to update an Account record in Salesforce with an Advertiser ID from DART, our primary ad serving platform, for the following reasons:
  1. It’s a tool that I could see being used in our live Salesforce instance.
  2. It seems like a typical use case for extending Salesforce (i.e. integrating with a 3rd party SOAP service).
The back of the napkin design looked like this:



At a high-level, I wanted to call DART’s DFP API from within Salesforce and then update an Account object in Salesforce with the Advertiser Id returned from DART. However, I first needed to authenticate with Google’s ClientLogin service in order to get an authentication token for calling the DFP API.

APEX

APEX is the programming language that allows a developer to customize a Salesforce installation. APEX’s syntax, not surprisingly, is very similar to Java. The really interesting thing is that none of the code you write actually compiles or runs on your machine. All compilation and execution happen “in the cloud”.

DART Integration

Salesforce has a strict security model. In order to make a request to a Web Service you actually need to configure any URLs you are accessing as a Remote Site. Instructions for doing this can be found here. For this project, I simply needed to add https://www.google.com as a Remote Site.

There are a couple of options for calling a Web Service via APEX:
  • Use the Http/HttpRequest/Http APEX classes. These are useful for calling REST style services.
  • Import a WSDL and use the generated code to make a SOAP request.
In this project, I ended up using both methods.

Here is the APEX code I developed for calling Google’s ClientLogin authentication service:
public class GoogleAuthIntegration {
private static String CLIENT_AUTH_URL = 'https://www.google.com/accounts/ClientLogin';

// login to google with the given email and password
public static String performClientLogin(final String email, final String password) {
final Http http = new Http();
final HttpRequest request = new HttpRequest();
request.setEndpoint(CLIENT_AUTH_URL);
request.setMethod('POST');
request.setHeader('Content-type', 'application/x-www-form-urlencoded');

final String body = 'service=gam&accountType=GOOGLE&' + 'Email=' + email + '&Passwd=' + password;
request.setBody(body);

final HttpResponse response = http.send(request);
final String responseBody = response.getBody();
final String authToken = responseBody.substring(responseBody.indexOf('Auth=') + 5).trim();

System.debug('authToken is: ' + authToken);
return authToken;
}
}
This piece of code would fetch an authToken for the given username and password. Once I had the authToken, I could then call the DFP API. For this part, I used WSDL/SOAP, the 2nd method for calling web services.

Salesforce provides a way to import a WSDL file via its Admin UI. It then parses and generates APEX code that allows you to call methods exposed by the WSDL. However, when I tried importing DFP’s Company Service WSDL, I ran into some errors:



It turns out that the WSDL contains an element named ‘trigger’ and trigger is a reserved APEX keyword. In any event, I ended up copy/pasting the generated code and fixing it so that it compiled correctly (I also ran into a problem where generated exception classes were not extending Exception).

Once the code to call the DFP Company Service was compiling, I created an APEX controller to perform the update on an Account record.
public class SyncDartAccountController {
private final Account acct;

public SyncDartAccountController(ApexPages.StandardController stdController) {
this.acct = (Account) stdController.getRecord();
}

// Code we will invoke on page load.
public PageReference onLoad() {
String theId = ApexPages.currentPage().getParameters().get('id');

if (theId == null) {
// Display the Visualforce page's content if no Id is passed over
return null;
}

// get authToken for DFP API requests
String authToken = GoogleAuthIntegration.performClientLogin('xxx@xxx.com', 'xxxx');

// get Account with the given id
for (Account o:[select id, name from Account where id =:theId]) {
DartCompanyService.CompanyServiceInterfacePort p = new DartCompanyService.CompanyServiceInterfacePort();
p.RequestHeader = new DartCompanyService.SoapRequestHeader();
p.RequestHeader.applicationName = 'sampleapp';

// prepare the DFP query and execute
DartCompanyService.Statement filterByNameAndType = new DartCompanyService.Statement();
filterByNameAndType.query = 'WHERE name = \'' + o.Name + '\' and type = \'ADVERTISER\'';

DartCompanyService.CompanyPage page = p.getCompaniesByStatement(filterByNameAndType);

if (page.totalResultSetSize > 0) {
// update the record if we get a result
o.Dart_Advertiser_Id__c = page.results.get(0).id;
update o;
}
}

// Redirect the user back to the original page
PageReference pageRef = new PageReference('/' + theId);
pageRef.setRedirect(true);
return pageRef;
}
}
UI updates

Then, I created a simple Visuaforce page to invoke the controller:

<apex:page standardController="Account" extensions="SyncDartAccountController" action="{!onLoad}">
<apex:sectionHeader title="Auto-Running Apex Code"/>
<apex:outputPanel >
You tried calling Apex Code from a button. If you see this page, something went wrong.
You should have been redirected back to the record you clicked the button from.
</apex:outputPanel>
</apex:page>

Finally, I added a custom button to the Account page which would invoke the Visualforce page. You can do this in the Salesforce UI:

1) Click on ‘Buttons and Links’:



2) Click New:


3) Enter the info for the new button:



4) After clicking on Save, we can add the button to the Account page layout. The final result:

Final Thoughts
This was my first foray into APEX programming in Salesforce and I was pleased with the overall set of tools and ability to be productive quickly. The only hiccup I encoutered was in the WSDL generation step and this issue was fairly easy to overcome. There are good developer docs and there are ways to add debug logging (which I didn’t go over) as well as a framework for unit testing.

Monday, September 20, 2010

quick script: emr-mailer

We write a lot of hive reports. Frequently we want to email the resulting report to a list. In the past I've usually done this with some one-off post processing scripts, but I thought it would be nice to write a reusable emr job step that will execute as part of the hive job.

The script will download files from an s3 url, concatenate them together, zip up the results and send it as an attachment to a specified email address. It sends email through smtp.mail.com, using account credentials you specify.

I wanted to make it easy to just append an additional step to any existing job, not requiring any additional machine setup or dependencies. I was able do this by making use of amazon's script-runner (s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar). The script-runner.jar step will let you execute an arbitrary script from a location in s3 as an emr job step.

As I mentioned, the intended usage is to run it as a job step with your hive script, passing it in the location of the resulting report.

E.g.:


elastic-mapreduce --create --name "my awesome report ${MONTH}" \
--num-instances 10 --instance-type c1.medium --hadoop-version 0.20 \
--hive-script --arg s3://path/to/hive/script.sql \
--args -d,MONTH=${MONTH} --args -d,START=${START} --args -d,END=${END} \
--jar s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar \
--args s3://path/to/emr-mailer/send-report.rb \
--args -n,report_${MONTH} --args -s,"my awesome report ${MONTH}" \
--args -e,awesome-reports@company.com \
--args -r,s3://path/to/report/results


Above you can see I'm starting a hive report as normal, then simply appending the script-runner step, calling the emr-mailer send-report.rb, telling it where the report will end up, and details about the email.


The full source code is available on github as emr-mailer.

The script is pretty simple, but let me know if you have any suggestions for improvements or other feedback.