MapReduce and downloading files from external source
I have a project where the requirement is to download files, in a distributed manner, from external sources. We already have a large investment in Hadoop and looking to leverage MapReduce --but more as a distributed task than ETL.
1) Has anyone done this before?
2) Should there be just a Mapper without a Reducer?
3) Whats the best way to pass in a abstract implementation of a FTP/HTTP connection to the Mapper? -- Just to be clear, what I was getting at is that I want a good way to unit test this without doing a integration test thus needing a way to mock FTP/HTTP.
4) Is MapReduce the best method for this type of thing? -- are we abusing MapReduce?
This 'sounds' similar to what Nutch does (although i'm not too familiar with Nutch beyond that statement).
Some points for observation:
- If you have several URLs which are hosted by the same server, you may actually benefit from partitioning by the hostname and then doing the pulls in the Reducer (depends on the number of URLs you are pulling from)
- If the content is 'cachable', and you will be pulling from the same URLs over and over, you 'may' benefit from putting a cache / proxy server between your hadoop cluster and the internet (your company and ISP may / should already be doing this). Although if you are hitting unique URLs or the content is dynamic this will actually hinder you as you have a single bottleneck in the cache/proxy server
I think you should take a look at Storm. It's a scalable framework that's very useful for data collection from many different sources. This is really what you're trying to do. Processing can still be done using map reduce, but for the actual collection you should use a framework like Storm.
I think your internet connection will easily become a bottleneck in this case but I'm sure it can be done.
- I haven't done this exact thing but have had to make a web service call from my Mapper to obtain some meta data from a 3rd party API for further processing. The 3rd party web service quickly became a bottleneck and slowed everything down.
- Yes since there's nothing to reduce in this case (I'm assuming you just want to save the downloaded files somewhere).
- I'd save the FTP/HTTP URLs in HDFS and have your Mapper read in the URLs from your HDFS.
- I highly doubt MapReduce is the best method for this type of thing. Like I said already, I think your internet connection will easily become a bottleneck and you won't be able to scale out your MR program very much. Once downloaded (and saved in HDFS), if you want to process the data using MapReduce, that would be a different story. Yes, in this case I'd say you're abusing MR.