Hadoop File Upload Process Inner Workings
I'm currently stuck with a problem where I can upload files to HDFS when running the client from any of the nodes actually in the cluster, but can't do the same when running the client from my local computer (even though I can do things like run an ls from my local client). I'm pretty sure this is a ports issue, but the smaller problem got me thinking I'd like to understand exactly what communication happens between my client computer, the namenode, and the datanodes when I try to upload a file anyway. So, can anyone enlighten me? What exactly happens when, over what ports, and between which computers?
This was an EC2 issue where I'd get Datanode EC2 private IPs returned by the namenode to all clients regardless of whether or not they were in EC2 or on our private network. Those ips obviously wouldn't work for clients outside of EC2, so any operation where a datanode was involved and getting hit from outside of EC2 would screw up. I never found a good solution for this and just decided to make people query from inside EC2 for now.