Re: mapper is slower than hive' mapper

Bertrand Dechoux Wed, 01 Aug 2012 09:03:00 -0700

My bad. I wasn't sure, at least I know now. But other solutions may use
other 'Serialization' strategies like Thrift (which is only other
customisation point of Hadoop).


Bertrand

On Wed, Aug 1, 2012 at 5:49 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> Hive does not use combiners it uses map side aggregation. Hive does
> use writables, sometimes it uses ones from hadoop, sometimes it uses
> its own custom writables for things like timestamps.
>
> On Wed, Aug 1, 2012 at 11:40 AM, Bertrand Dechoux <decho...@gmail.com>
> wrote:
> > I am not sure about Hive but if you look at Cascading they use a pseudo
> > combiner instead of the standard (I mean Hadoop's) combiner.
> > I guess Hive has a similar strategy.
> >
> > The point is that when you use a compiler, the compiler does smart thing
> > that you don't need to think about (like loop unwinding).
> > The result is that your code is still readable but optimized and in most
> > cases the compiler will do better than you.
> >
> > Even your naive implementation of the Mapper (without the Reducer and the
> > configuration) is more complicated than the whole Hive query.
> >
> > Like Chuck said Hive is basically a MapReduce compiler. It is fun to
> look at
> > how it works. But it is often best to let the compiler work for you
> instead
> > of trying to beat it.
> >
> > For simple cases, like a 'select', Hive (or any other same-level
> alternative
> > solutions) is helpful. And for complex cases, with multiple joins, you
> will
> > want to have something like Hive too because with the vanilla MapReduce
> API
> > it can become quite hard to grasp everything. Basically, two reasons :
> > faster to express and cheaper to maintain.
> >
> > One reason not to use Hive is if your approach is more programmatic like
> if
> > you want to do machine learning which will require highly specific
> workflow
> > and user defined functions.
> >
> > It would be interesting to know your issue : are you trying to benchmark
> > Hive (and you)? Or have you any other reasons?
> >
> > Bertrand
> >
> >
> > On Wed, Aug 1, 2012 at 5:13 PM, Edward Capriolo <edlinuxg...@gmail.com>
> > wrote:
> >>
> >> As mentioned, if you avoid using new, by re-using objects and possibly
> >> use buffer objects you may be able to match or beat the speed. But in
> >> the general case the hive saves you time by allowing you not to worry
> >> about low level details like this.
> >>
> >> On Wed, Aug 1, 2012 at 10:35 AM, Connell, Chuck
> >> <chuck.conn...@nuance.com> wrote:
> >> > This is actually not surprising. Hive is essentially a MapReduce
> >> > compiler. It is common for regular compilers (C, C#, Fortran) to emit
> faster
> >> > assembler code than you write yourself. Compilers know the tricks of
> their
> >> > target language.
> >> >
> >> > Chuck Connell
> >> > Nuance R&D Data Team
> >> > Burlington, MA
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Yue Guan [mailto:pipeha...@gmail.com]
> >> > Sent: Wednesday, August 01, 2012 10:29 AM
> >> > To: user@hive.apache.org
> >> > Subject: mapper is slower than hive' mapper
> >> >
> >> > Hi, there
> >> >
> >> > I'm writing mapreduce to replace some hive query and I find that my
> >> > mapper is slow than hive's mapper. The Hive query is like:
> >> >
> >> > select sum(column1) from table group by column2, column3;
> >> >
> >> > My mapreduce program likes this:
> >> >
> >> >      public static class HiveTableMapper extends Mapper<BytesWritable,
> >> > Text, MyKey, DoubleWritable> {
> >> >
> >> >          public void map(BytesWritable key, Text value, Context
> context)
> >> > throws IOException, InterruptedException {
> >> >                  String[] sLine = StringUtils.split(value.toString(),
> >> > StringUtils.ESCAPE_CHAR, HIVE_FIELD_DELIMITER_CHAR);
> >> >              context.write(new MyKey(Integer.parseInt(sLine[0]),
> >> > sLine[1]), new DoubleWritable(Double.parseDouble(sLine[2])));
> >> >          }
> >> >
> >> >      }
> >> >
> >> > I assume hive is doing something similar. Is there any trick in hive
> to
> >> > speed this thing up? Thank you!
> >> >
> >> > Best,
> >> >
> >
> >
> >
> >
> > --
> > Bertrand Dechoux
>



-- 
Bertrand Dechoux

Re: mapper is slower than hive' mapper

Reply via email to