Show how a parquet file is replicated and stored on HDFS

Data stored in parquet format results in a folder with many small files on HDFS.

Is there a way to view how those files are replicated in HDFS (on which nodes)?

Thanks in advance.

Answers


If I understand your question correctly, you actually want to track which data blocks is on which data node and that's not apache-spark specific.

You can use hadoop fsck command as followed :

hadoop fsck <path> -files -blocks -locations    

This will print out locations for every block in the specified path.


Need Your Help

Android sdk: Network check dialog never shows up

android android-dialog

I have very little experience with android development, i have made a simple app with a webview, i want to be able to alert the user when there is no network connectivity. however i am not able to ...