MapReduce is very course-grained. It might seem that more cores is better,
but once the data sizes get well below the block threshold in size, the
overhead of starting JVM processes and all the other background becomes a
significant percentage of the overall runtime. So, you quickly reach the
point
Hi Dean,
Indeed, switching from RCFiles to SequenceFiles yield a query duration down
35% (82secs down to 53secs) ! I added Snappy/Gzip block compression
altogether. Things are getting better, down to 30secs (sequenceFile+snappy).
Yes, most request have a WHERE clause with a time range, will have
RCFile won't help much (and apparently not all in this case ;) unless you
have a lot of columns and you always query just a few of them. However, you
should get better results with Sequence Files (binary format) and usually
with a compression scheme like BZip that supports block-level (as opposed
t
Hi there,
I've setup a virtual machine hosting Hive.
My use case is a Web traffic analytics, hence most of requests are :
- how many requests today ?
- how many request today, grouped by country ?
- most requested urls ?
- average http server response time (5 minutes slots) ?
In other words, let