EMR hdfs transparently backed by s3

With hadoop I can use s3 as a storage url. But currently I have a lot of applications using hdfs://... and I would like to migrate the whole cluster and apps to EMR and s3. do i have to change url in every single app from hdfs://... to s3://... or is it possible to somehow tell EMR to store hdfs content on s3 so each application can still use hdfs://... but in fact it will point to s3? if so, how?

Answers


That's a very good question. is there such a thing as protocol spoofing? could you actually affect this behavior by writing something that overrides how protocols are handled? Honestly that kind of a solution gives me the heeby-jeebies because if someone doesn't know that's happening and then gets unexpected pathing, and can't really diagnose or fix it, that's worse than the original problem.

if I were you, I'd do a find-replace over all my apps to just update the protocol.

let's say you had all of your apps in a directory:

-- myApps
  |-- app1.txt
  |-- app2.txt

and you wanted to find and replace hdfs:// with s3:// in all of those apps, I'd just do something like this:

sed -i .original 's/hdfs/s3/h' *

which produces:

-- myApps
  |-- app1.txt
  |-- app1.txt.original
  |-- app2.txt
  |-- app2.txt.original

and now app1.txt has s3:// everywhere rather than hdfs://

Isn't that enough?


The applications shall be refactored so that the input and output paths are not hard-coded. Instead, they shall be injected into the applications, after being read from some configuration files or parsed from command line arguments.

Take the following Pig script for example:

loaded_records =
    LOAD '$input'
    USING PigStorage();
--
-- ... magic processing ...
--
STORE processed_records
    INTO '$output'
    USING PigStorage();

We can then have a wrapper script like this:

#!/usr/bin/env bash
config_file=${1:?"Missing config_file"}

[[ -f "$config_file" ]] && source "$config_file" || { echo "Failed to source config file $config_file"; exit 1; }

pig -p input="${input_root:?'Missing parameter input_root in config_file'}/my_input_path" -p output="${output:?'Missing parameter output_root in config_file'}/my_output_path" the_pig_script.pig

In the config file:

input_root="s3://mybucket/input"
output_root="s3://mybucket/output"

If you have this kind of setup, you only have to do the configuration changes to switch between hdfs and s3.


Need Your Help

Can I be sure that Android application finished running by checking isFinish() in the main Activity's onPause()?

java android android-activity

I have to clear some application data after my application is finished running.As far as I know onDestroy() may not be called. So I decided to check if isFinishing() == true in onPause() of the root

How to handle missing images URL in MVC view

c# asp.net-mvc

I have an MVC view that has several images(1- per page).