Hi Dave, This is HBase solution to the poor scan performance issue: https://issues.apache.org/jira/browse/HBASE-8369
I encountered the same issue before. To the best of my knowledge, this is not a mapreduce issue. It is hbase issue. If you are planning to swap out mapreduce and replace it with spark, I don't think you can get a lot of performance from scanning HBase unless you are talking about caching the results from HBase in spark and reuse it over and over. HTH, Jerry On Wed, Apr 9, 2014 at 12:02 PM, David Quigley <dquigle...@gmail.com> wrote: > Hi all, > > We are currently using hbase to store user data and periodically doing a > full scan to aggregate data. The reason we use hbase is that we need a > single user's data to be contiguous, so as user data comes in, we need the > ability to update a random access store. > > The performance of a full hbase scan with MapReduce is frustratingly slow, > despite implementing recommended optimizations. I see that it is possible > to scan hbase with Spark, but am not familiar with how Spark interfaces > with hbase. Would you expect the scan to perform similarly if used as a > Spark input as a MapReduce input? > > Thanks, > Dave >