Hi Paul, I am having the same problem. Do you know any efficient way of merging the files?
-Mohit On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <pmack...@adobe.com> wrote: > How much time is it spending in the map/reduce phases, respectively? The > large number of files could be creating a lot of mappers which create a lot > of overhead. What happens if you merge the 2624 files into a smaller number > like 24 or 48. That should speed up the mapper phase significantly.**** > > ** ** > > *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] > *Sent:* Tuesday, December 06, 2011 6:01 AM > *To:* user@hive.apache.org > *Subject:* Hive query taking too much time**** > > ** ** > > Hi All,**** > > ** ** > > My setup is **** > > hadoop-0.20.203.0**** > > hive-0.7.1**** > > ** ** > > I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is > also acting as secondary name node). On namenode I have setup hive with > HiveDerbyServerMode to support multiple hive server connection.**** > > ** ** > > I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query > statements, total number of files is 2624 an their combined size is only > 713 MB, which is very less from Hadoop perspective that can handle TBs of > data very easily.**** > > ** ** > > The problem is, when I run a simple count query (i.e. *select count(*) > from a_table*), it takes too much time in executing the query.**** > > ** ** > > For instance it takes almost 17 minutes to execute the said query if the > table has 950,000 rows, I understand that time is too much for executing a > query with only such small data. **** > > This is only a dev environment and in production environment the number of > files and their combined size will move into millions and GBs respectively. > **** > > ** ** > > On analyzing the logs on all the datanodes and namenode/secondary namenode > I do not find any error in them.**** > > ** ** > > I have tried setting mapred.reduce.tasks to a fixed number also, but > number of reduce always remains 1 while number of maps is determined by > hive only.**** > > ** ** > > Any suggestion what I am doing wrong, or how can I improve the performance > of hive queries? Any suggestion or pointer is highly appreciated. **** > > ** ** > > Keshav**** > > _____________ > The information contained in this message is proprietary and/or > confidential. If you are not the intended recipient, please: (i) delete the > message and all copies; (ii) do not disclose, distribute or use the message > in any manner; and (iii) notify the sender immediately. In addition, please > be aware that any message addressed to our domain is subject to archiving > and review by persons other than the intended recipient. Thank you.**** > -- Best Regards, Mohit Gupta Software Engineer at Vdopia Inc.