I think we've hinted at this before, but here at bizo, the only machines we have are our laptops. Everything else is at amazon. All of our servers are ec2 instances and all of our data is in s3.
In some of our larger hadoop jobs, we're processing hundreds of gigs worth of input data, and it literally takes a couple of hours to copy data from s3 to a local machine, and then to hdfs. And once the job is done, it's back to s3. Additionally, we're starting and stopping clusters on demand, per job, so we're really not getting any data locality benefit from hdfs.
How do we speed this up... Hadoop comes with a tool, distcp, that is basically a Map/Reduce job to copy files to hdfs, so there's some benefit there to distribute the copy.
It would be great to skip all this copying and just use s3 directly. All of our input is already there, and that's where we want our output. And once the data's on s3, it's easy to move around, copy, rename, etc.
Hadoop does include s3 support, with 2 filesystems. The s3 Block Filesystem (s3), and the The s3 Native Filesystem (s3n).
The S3 Block Filesystem is basically HDFS, using s3 as the datastore. Files are stored as blocks directly into s3. This means that you need to dedicate a bucket to the filesystem, and can't use any s3 tools to read or write data. I believe the filesystem was designed this way to overcome the s3 object size limit, as well as to support renames (the rename support mentioned above wasn't available in the original s3 api). I don't quite get the point of this filesystem now, though. There doesn't seem to be much benefit to using s3, except I guess you don't need the NameNode and DataNode processes up and running.
The S3 Native Filesystem is what you want -- you can use real s3 objects directly, meaning, use the amazon apis directly, and any s3 tools to work with your job inputs and outputs. This is great!
Unfortunately, it's not quite here yet as a replacement for hdfs. This isn't totally clear... In fact, in the config the example says just replace "s3" with "s3n" in your config. So, I spent some time today trying to get it working. It worked great for my map tasks, but at reduce, I kept getting:
08/11/19 22:19:09 INFO mapred.JobClient: Task Id : attempt_200811192212_0002_r_000000_0, Status : FAILED
Failed to rename output with the exception: java.io.IOException: Not supported
I figured why would there be a s3n filesystem if you couldn't use it? The docs mention that rename isn't supported, but it's not clear that this means you can't use it as a hdfs replacement. Searching on this got me nothing, but after reading HADOOP-336 and other docs more closely it became clear this wasn't going to work.
Anyway, I thought I'd throw this post together and update the wiki to hopefully save someone else some time.
It looks like Support for renaming will be added in 0.19.0, which should then allow you to go all S3(n)! with you hadoop jobs.