Hi, The entire table of 34 million records is in a single ORC file. and its around 7 GB in size. the other ORC file is a dimension table with less than 40 MB of records once again in a single ORC file.
I do not remember setting anywhere ORC file stripe size. The problem that I am facing is the query is triggering only a single mapper though the cluster has three nodes. Unlike other posts here I need more mappers. The other mentioned properties are mentioned below from the job xml file: <property><name>mapred.min.split.size.per.node</name><value>1</value></property> and <property><name>mapred.max.split.size</name><value>256000000</value></property> I am sure that there is no issue with HADOOP configuration as with some other queries I am getting more than 24 mappers. Please accept my sincere regards for your kind help and insights. Thanks, Gourav Sengupta On Wed, Oct 9, 2013 at 6:22 PM, Prasanth Jayachandran < pjayachand...@hortonworks.com> wrote: > What is your ORC file stripe size? How many ORC files are there in each of > the tables? It could be possible that ORC compressed the file so much that > the file size is less than the HDFS block size. Can you please report the > file size of the two ORC files? > > Another possibility is that there are many small files. In that case by > default hive uses CombineHiveInputFormat which combines many small files > into a single large file. Hence you will see less number of mappers. If you > are expecting one mapper per hdfs file, then try disabling > CombineHiveInputFormat by "set > hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;". Another > way to control the number of mappers is by adjusting the min and max split > size. > > Thanks > Prasanth Jayachandran > > On Oct 9, 2013, at 10:03 AM, Nitin Pawar <nitinpawar...@gmail.com> wrote: > > > whats the size of the table? (in GBs? ) > > > > Whats the max and min split sizes have you provied? > > > > > > On Wed, Oct 9, 2013 at 10:28 PM, Gourav Sengupta < > gourav.had...@gmail.com>wrote: > > > >> Hi, > >> > >> I am trying to run a join using two tables stored in ORC file format. > >> > >> The first table has 34 million records and the second has around 300,000 > >> records. > >> > >> Setting "set hive.auto.convert.join=true" makes the entire query run > via a > >> single mapper. > >> In case I am setting "set hive.auto.convert.join=false" then there are > two > >> mappers first one reads the second table and then the entire large table > >> goes through the second mapper. > >> > >> Is there something that I am doing wrong because there are three nodes > in > >> the HADOOP cluster currently and I was expecting that at least 6 mappers > >> should have been used. > >> > >> Thanks and Regards, > >> Gourav > >> > > > > > > > > -- > > Nitin Pawar > > > -- > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity to > which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. >