Hi,

The entire table of 34 million records is in a single ORC file. and its
around 7 GB in size. the other ORC file is a dimension table with less than
40 MB of records once again in a single ORC file.

I do not remember setting anywhere ORC file stripe size.

The problem that I am facing is the query is triggering only a single
mapper though the cluster has three nodes. Unlike other posts here I need
more mappers.

The other mentioned properties are mentioned below from the job xml file:
<property><name>mapred.min.split.size.per.node</name><value>1</value></property>
and
<property><name>mapred.max.split.size</name><value>256000000</value></property>

I am sure that there is no issue with HADOOP configuration as with some
other queries I am getting more than 24 mappers.

Please accept my sincere regards for your kind help and insights.


Thanks,
Gourav Sengupta



On Wed, Oct 9, 2013 at 6:22 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> What is your ORC file stripe size? How many ORC files are there in each of
> the tables? It could be possible that ORC compressed the file so much that
> the file size is less than the HDFS block size. Can you please report the
> file size of the two ORC files?
>
> Another possibility is that there are many small files. In that case by
> default hive uses CombineHiveInputFormat which combines many small files
> into a single large file. Hence you will see less number of mappers. If you
> are expecting one mapper per hdfs file, then try disabling
> CombineHiveInputFormat by "set
> hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another
> way to control the number of mappers is by adjusting the min and max split
> size.
>
> Thanks
> Prasanth Jayachandran
>
> On Oct 9, 2013, at 10:03 AM, Nitin Pawar <nitinpawar...@gmail.com> wrote:
>
> > whats the size of the table? (in GBs? )
> >
> > Whats the max and min split sizes have you provied?
> >
> >
> > On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta <
> gourav.had...@gmail.com>wrote:
> >
> >> Hi,
> >>
> >> I am trying to run a join using two tables stored in ORC file format.
> >>
> >> The first table has 34 million records and the second has around 300,000
> >> records.
> >>
> >> Setting "set hive.auto.convert.join=true" makes the entire query run
> via a
> >> single mapper.
> >> In case I am setting "set hive.auto.convert.join=false" then there are
> two
> >> mappers first one reads the second table and then the entire large table
> >> goes through the second mapper.
> >>
> >> Is there something that I am doing wrong because there are three nodes
> in
> >> the HADOOP cluster currently and I was expecting that at least 6 mappers
> >> should have been used.
> >>
> >> Thanks and Regards,
> >> Gourav
> >>
> >
> >
> >
> > --
> > Nitin Pawar
>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Reply via email to