On Wed, Mar 30, 2011 at 3:31 PM, V.Senthil Kumar <vaisen2...@yahoo.com> wrote: > Thanks for the suggestion. The query created just one result file. > > Also, before trying this query, I have found out another way of making this > work. I have added the following properties in hive-site.xml and it worked as > well. It created just one result file. > > > <property> > <name>hive.merge.mapredfiles</name> > <value>true</value> > <description>Merge small files at the end of a map-reduce job</description> > </property> > > <property> > <name>hive.input.format</name> > <value>org.apache.hadoop.hive.ql.io.CombineHiveInputFormat</value> > <description>The default input format, if it is not specified, the system > assigns it. It is set to HiveInputFormat for hadoop versions 17, 18 and 19, > whereas it is set to CombineHiveInputFormat for hadoop 20. The user can always > overwrite it - if there is a bug in CombineHiveInputFormat, it can always be > manually set to HiveInputFormat. </description> > </property> > > > > ----- Original Message ---- > From: Jov <zhao6...@gmail.com> > To: user@hive.apache.org > Sent: Tue, March 29, 2011 10:22:32 PM > Subject: Re: INSERT OVERWRITE LOCAL DIRECTORY -- Why it creates multiple files > > try add limit: > > INSERT OVERWRITE LOCAL DIRECTORY > '/home/hdp-user/hiveadmin_dirs/outbox/apachetest' > Select host, identity, user, time, request > from raw_apachelog > where ds = '2011-03-22-001500' limit 32; > > > 2011/3/30 V.Senthil Kumar <vaisen2...@yahoo.com>: >> Hello, >> >> I have a hive query which does a simple select and writes the results to a >>local >> >> file system. >> >> >> For example, a query like this, >> >> INSERT OVERWRITE LOCAL DIRECTORY >> '/home/hdp-user/hiveadmin_dirs/outbox/apachetest' >> Select host, identity, user, time, request >> from raw_apachelog >> where ds = '2011-03-22-001500'; >> >> Now this creates a two files under apachetest folder. This table has only 32 >> rows. Is there any way I can make Hive to create only single file ? >> >> >> Appreciate your help :) >> >> Thanks, >> Senthil >> > >
The number of files is a result of the number of reducers used in the job. Adding a limit adds a single reducer phase to the job end. You should be able to accomplish the same thing with 'set mapred.reduce.tasks=1'