Show how a parquet file is replicated and stored on HDFS
Data stored in parquet format results in a folder with many small files on HDFS.
Is there a way to view how those files are replicated in HDFS (on which nodes)?
Thanks in advance.
If I understand your question correctly, you actually want to track which data blocks is on which data node and that's not apache-spark specific.
You can use hadoop fsck command as followed :
hadoop fsck <path> -files -blocks -locations
This will print out locations for every block in the specified path.