How Data distribution is achieved in Azure HDInsight while processing it
One of the selling points of Hadoop is that the data sits with the compute? How does that work with WASB? When processing a MapReduce job the map and reduce tasks are executed where the blocks of data are resided. This way the data locality is achieved. But in the case of HDInsight, the data is stored in the wasb. So when the MapReduce is executed does the data is copied from wasb to each of the compute node and then the processing is proceeded? If so, then the single channel to copy data to compute nodes will be a bottleneck.
Can anyone explain to me how data is stored on wasb and how during processing the data is handled?
Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the Azure storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same with disks locally attached to the VMs.
HDInsight clusters are located in any of Azure's regions. The storage accounts that clusters can read from can only be from the same region to avoid high latency. Azure has done a lot of work on its data centers so that performance is comparable.
If you want to learn more, Ashish's quote comes from this article: https://blogs.msdn.microsoft.com/cindygross/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure/