Re: mapper is slower than hive' mapper

Yue Guan Wed, 01 Aug 2012 10:36:49 -0700

The story here is that we have a work flow based on hive queries. Ittakes several stages to get to the final data. For each stage, we have ahive table. And we try to write the whole work flow in mapreduce.Ideally, it will remove all the intermediate process and take two roundsof mapreduce to do the job.

I just try the buffer in mapper approach, the number of map outputrecord matches with Hive. Thank you


On 08/01/2012 11:40 AM, Bertrand Dechoux wrote:

I am not sure about Hive but if you look at Cascading they use apseudo combiner instead of the standard (I mean Hadoop's) combiner.
I guess Hive has a similar strategy.
The point is that when you use a compiler, the compiler does smartthing that you don't need to think about (like loop unwinding).The result is that your code is still readable but optimized and inmost cases the compiler will do better than you.
Even your naive implementation of the Mapper (without the Reducer andthe configuration) is more complicated than the whole Hive query.
Like Chuck said Hive is basically a MapReduce compiler. It is fun tolook at how it works. But it is often best to let the compiler workfor you instead of trying to beat it.
For simple cases, like a 'select', Hive (or any other same-levelalternative solutions) is helpful. And for complex cases, withmultiple joins, you will want to have something like Hive too becausewith the vanilla MapReduce API it can become quite hard to graspeverything. Basically, two reasons : faster to express and cheaper tomaintain.
One reason not to use Hive is if your approach is more programmaticlike if you want to do machine learning which will require highlyspecific workflow and user defined functions.
It would be interesting to know your issue : are you trying tobenchmark Hive (and you)? Or have you any other reasons?
Bertrand
On Wed, Aug 1, 2012 at 5:13 PM, Edward Capriolo <edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>> wrote:
    As mentioned, if you avoid using new, by re-using objects and possibly
    use buffer objects you may be able to match or beat the speed. But in
    the general case the hive saves you time by allowing you not to worry
    about low level details like this.

    On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
    <chuck.conn...@nuance.com <mailto:chuck.conn...@nuance.com>> wrote:
    > This is actually not surprising. Hive is essentially a MapReduce
    compiler. It is common for regular compilers (C, C#, Fortran) to
    emit faster assembler code than you write yourself. Compilers know
    the tricks of their target language.
    >
    > Chuck Connell
    > Nuance R&D Data Team
    > Burlington, MA
    >
    >
    > -----Original Message-----
    > From: Yue Guan [mailto:pipeha...@gmail.com
    <mailto:pipeha...@gmail.com>]
    > Sent: Wednesday, August 01, 2012 10:29 AM
    > To: user@hive.apache.org <mailto:user@hive.apache.org>
    > Subject: mapper is slower than hive' mapper
    >
    > Hi, there
    >
    > I'm writing mapreduce to replace some hive query and I find that
    my mapper is slow than hive's mapper. The Hive query is like:
    >
    > select sum(column1) from table group by column2, column3;
    >
    > My mapreduce program likes this:
    >
    >      public static class HiveTableMapper extends
    Mapper<BytesWritable, Text, MyKey, DoubleWritable> {
    >
    >          public void map(BytesWritable key, Text value, Context
    context) throws IOException, InterruptedException {
    >                  String[] sLine =
    StringUtils.split(value.toString(),
    > StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
    >              context.write(new MyKey(Integer.parseInt(sLine[0]),
    sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
    >          }
    >
    >      }
    >
    > I assume hive is doing something similar. Is there any trick in
    hive to speed this thing up? Thank you!
    >
    > Best,
    >




--
Bertrand Dechoux

Re: mapper is slower than hive' mapper

Reply via email to