MapReduce workflow benchmarks
Can any body across by any benchmarks for testing MapReduce workflows ? or in general BigData workflow benchmarks to test the performance and accuracy of some workflow systems like Oozie ?
Probably the best known MapReduce benchmark is Terasort. It takes a large number of randomly generated records and sorts the entire data set. This simulates a real large-scale MapReduce job that includes both mappers and reducers. It is included with MapReduce so you don't have to separately install it.
The first step is to generate the input data with Teragen, using the MapReduce examples jar in your MapReduce lib directory:
hadoop jar hadoop-*examples*.jar teragen <number of 100-byte rows> <output dir>
The second step is to run Terasort on the generated input data. The time this step takes is the result of the benchmark:
hadoop jar hadoop-*examples*.jar terasort <input dir> <output dir>
Optionally, a third step is to validate that the output results are correct using Teravalidate:
$ hadoop jar hadoop-*examples*.jar teravalidate <terasort output dir (= input data)> <teravalidate output dir>
It can be very difficult to compare the timings of this benchmark from one cluster to another, but it can be useful for comparing across changes within the same cluster, such as modifying configurations, or adding new nodes.
There is an in-depth description of Terasort in this blog entry.