Hadoop MR source: HDFS vs HBase. Benefits of each?
If I understand the Hadoop ecosystem correctly, I can run my MapReduce jobs sourcing data from either HDFS or HBase. Assuming the previous assumption is correct, why would I choose one over the other? Is there a benefit of performance, reliability, cost, or ease of use to using HBase as a MR source?
The best I've been able to find is this quote, "HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets." - Tom White (2009) Hadoop: The Definitive Guide, 1st Edition
Using straight-up Hadoop Map/Reduce over HDFS, your inputs and outputs are typically stored as flat text files or Hadoop SequenceFiles, which are simply serialized objects streamed to disk. These data stores are more or less immutable. This makes Hadoop suitable for batch processing tasks.
HBase is a full-fledged database (albeit not relational) which uses HDFS as storage. This means you can run interactive queries and updates on your dataset.
What's nice about HBase is that it plays nicely with the Hadoop ecosystem, so if you have the need to perform batch processing as well as interactive, granular, record-level operations on huge datasets, HBase will do both well.
Some relevant limitations of HDFS (which is an open-source twin to the Google File System) are found in the original Google File System paper.
About the target use cases, we read:
Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. [...]
[...] Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, [...]
As a result:
[...] we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them.
A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, [...]
If I read the paper correctly, then the several replicas of each file (in the HDFS sense) will not necessarily be exactly the same. If the clients only use the atomic operations, each file could be considered as a concatenation of records (each from one of those operations), but these may appear duplicated in some of the replicas, and their order may be different from replica to replica. (Though apparently there may also be some padding inserted, so it's not even as clean as that — read the paper.) It's up to the user to manage the record boundaries, unique identifiers, checksums, etc.
So this isn't at all like the file systems we are used to on our desktop machines.
Note that HDFS is no good for many small files, because:
Each would allocate typically a 64 MB chunk (source).
Its architecture isn't good at managing a huge number of filenames (source: the same as in item 1). There is a single master maintaining all the filenames (which hopefully fit in its RAM).