How to check the overall progress of PIG job
A pig script can be translated into multiple MR jobs and I am wondering if there is an interface or a way to see the progress of the overall PIG script like how many jobs are scheduled, executed and so on.
There is a command illustrate but it throws an exception on my deployment. So I use another approach.
You can get the information on how many MR jobs are scheduled by using explain command and looking at the Physical Plan section, which is at the end of the explain report. To get the number of MR jobs for the script I do the following:
./pig -e 'explain -script ./script_name.pig' > ./explain.txt grep MapReduce ./explain.txt | wc -l
Now we have the number of MR jobs planned. To monitor script execution, before you run it, you need to access Hadoop's jobtracker page (via "http://(IP_or_node_name):50030/jobtracker.jsp") and write down the name of last job (Completed Jobs section). Submit the script. Refresh the jobtracker page and count how many running jobs there are and how many are completed after the one you have noted. Now you can get an idea of how many jobs are left to be executed. Click on each job and see its statistics and progress.
A much simpler approach would be to run the script on a small dataset, note down the number of jobs, it is displayed on the console output after the script execution. As pig does not change its execution plan, it will be the same with the big dataset. By looking into stats of each job on Hadoop's jobtracker page (via "http://(IP_or_node_name):50030/jobtracker.jsp") you can get the idea of the proportion of time each MR job takes. Than you can use it to approximately interpolate the execution time on large dataset. If you have skewed data and some Cartesian products, execution time prediction might become tricky.
We had the same problem at Twitter, as some of our Pig scripts spin up dozens of Map-Reduce jobs and it's sometimes hard to tell which of them is doing what, reason about efficiency of the plan, understand how many will run in parallel, etc.
So we created Twitter Ambrose: https://github.com/twitter/ambrose
It spins up a little jetty server which gives you a nice web ui that shows the job DAG, colors the nodes as the jobs complete, gives you stats about the jobs, and tells you which relations each job is trying to calculate.