hey if u having the same col of all the files then you can easily merge by shell script
list=`*.csv` $table=yourtable for file in $list do cat $file >>new_file.csv done hive -e "load data local inpath '$file' into table $table" it will merge all the files in single file then you can upload it in the same query On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta <success.mohit.gu...@gmail.com>wrote: > Hi Paul, > I am having the same problem. Do you know any efficient way of merging the > files? > > -Mohit > > > On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <pmack...@adobe.com> wrote: > >> How much time is it spending in the map/reduce phases, respectively? The >> large number of files could be creating a lot of mappers which create a lot >> of overhead. What happens if you merge the 2624 files into a smaller number >> like 24 or 48. That should speed up the mapper phase significantly.**** >> >> ** ** >> >> *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com] >> *Sent:* Tuesday, December 06, 2011 6:01 AM >> *To:* user@hive.apache.org >> *Subject:* Hive query taking too much time**** >> >> ** ** >> >> Hi All,**** >> >> ** ** >> >> My setup is **** >> >> hadoop-0.20.203.0**** >> >> hive-0.7.1**** >> >> ** ** >> >> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is >> also acting as secondary name node). On namenode I have setup hive with >> HiveDerbyServerMode to support multiple hive server connection.**** >> >> ** ** >> >> I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query >> statements, total number of files is 2624 an their combined size is only >> 713 MB, which is very less from Hadoop perspective that can handle TBs of >> data very easily.**** >> >> ** ** >> >> The problem is, when I run a simple count query (i.e. *select count(*) >> from a_table*), it takes too much time in executing the query.**** >> >> ** ** >> >> For instance it takes almost 17 minutes to execute the said query if the >> table has 950,000 rows, I understand that time is too much for executing a >> query with only such small data. **** >> >> This is only a dev environment and in production environment the number >> of files and their combined size will move into millions and GBs >> respectively.**** >> >> ** ** >> >> On analyzing the logs on all the datanodes and namenode/secondary >> namenode I do not find any error in them.**** >> >> ** ** >> >> I have tried setting mapred.reduce.tasks to a fixed number also, but >> number of reduce always remains 1 while number of maps is determined by >> hive only.**** >> >> ** ** >> >> Any suggestion what I am doing wrong, or how can I improve the >> performance of hive queries? Any suggestion or pointer is highly >> appreciated. **** >> >> ** ** >> >> Keshav**** >> >> _____________ >> The information contained in this message is proprietary and/or >> confidential. If you are not the intended recipient, please: (i) delete the >> message and all copies; (ii) do not disclose, distribute or use the message >> in any manner; and (iii) notify the sender immediately. In addition, please >> be aware that any message addressed to our domain is subject to archiving >> and review by persons other than the intended recipient. Thank you.**** >> > > > > -- > Best Regards, > > Mohit Gupta > Software Engineer at Vdopia Inc. > > > -- With Regards Vikas Srivastava DWH & Analytics Team Mob:+91 9560885900 One97 | Let's get talking !