Re: Hive produces very small files despite hive.merge...=true settings

2010-11-23 Thread Ning Zhang
This should be expected. Compressed text files are not splittable so that CombineHiveInputFormat cannot read multiple files per mapper. CombinedHiveInputFormat is used when hive.merge.maponly=true. If you set it to false, we'll use HiveInputFormat and that should be able to merge compressed tex

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-22 Thread Leo Alekseyev
I found another criterion that determines whether or not the merge job runs with compression turned on. It seems that if the target table is stored as an rcfile, merges work, but if a text file, merges will fail. For instance: -- merge will work here: create table alogs_dbg_sample3 (server_host

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread Ning Zhang
It makes sense. CombineHiveInputFormat does not work with compressed text files (suffix *.gz) since it is not splittable. I think your default hive.file.format=CombineHiveInputFormat. But I think by setting hive.merge.maponly it should work (meaning merge should be succeeded). By setting hive.m

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread yongqiang he
I can not think this could be the cause. The problem should be: your files can not be merged. I mean the file size is bigger than the split size On Friday, November 19, 2010, Leo Alekseyev wrote: > Folks, thanks for your help.  I've narrowed the problem down to > compression.  When I set hive.ex

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread Leo Alekseyev
Folks, thanks for your help. I've narrowed the problem down to compression. When I set hive.exec.compress.output=false, merges proceed as expected. When compression is on, the merge job doesn't seem to actually merge, it just spits out the input. On Fri, Nov 19, 2010 at 10:51 AM, yongqiang he

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread yongqiang he
These are the parameters that control the behavior. (Try to set them to different values if it does not work in your environment.) set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size.per.node=10; set mapred.min.split.size.per.rack=10

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread Leo Alekseyev
I'm using Hadoop 0.20.2. Merge jobs (with static partitions) have worked for me in the past. Again, what's strange here is with the latest Hive build the merge stage appears to run, but it doesn't actually merge -- it's a quick map-only job that, near as I can tell, doesn't do anything. On Fri,

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread Dave Brondsema
What version of Hadoop are you on? On Thu, Nov 18, 2010 at 10:48 PM, Leo Alekseyev wrote: > I thought I was running Hive with those changes merged in, but to make > sure, I built the latest trunk version. The behavior changed somewhat > (as in, it runs 2 stages instead of 1), but it still gener

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-18 Thread Leo Alekseyev
I thought I was running Hive with those changes merged in, but to make sure, I built the latest trunk version. The behavior changed somewhat (as in, it runs 2 stages instead of 1), but it still generates the same number of files (# of files generated is equal to the number of the original mappers,

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-18 Thread Ning Zhang
I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 need to be there for merging to take place. HIVE-1307 was committed to trunk on 08/25 and HIVE-1622 was committed on 09/13. The simplest way is to update your Hive trunk and rerun the query. If it still doesn't work maybe you ca

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-18 Thread Ted Yu
Leo: You may find this helpful: http://indoos.wordpress.com/2010/06/24/hive-remote-debugging/ On Thu, Nov 18, 2010 at 2:57 PM, Leo Alekseyev wrote: > Hi Ning, > For the dataset I'm experimenting with, the total size of the output > is 2mb, and the files are at most a few kb in size. My > hive.i

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-18 Thread Leo Alekseyev
Hi Ning, For the dataset I'm experimenting with, the total size of the output is 2mb, and the files are at most a few kb in size. My hive.input.format was set to default HiveInputFormat; however, when I set it to CombineHiveInputFormat, it only made the first stage of the job use fewer mappers. T

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-18 Thread Ning Zhang
The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is used to determine at run time if a merge should be triggered: if the average size of the files in the partition is SMALLER than the parameter and there are more than 1 file, the merge should be scheduled. Can you try to

Hive produces very small files despite hive.merge...=true settings

2010-11-17 Thread Leo Alekseyev
I have jobs that sample (or generate) a small amount of data from a large table. At the end, I get e.g. about 3000 or more files of 1kb or so. This becomes a nuisance. How can I make Hive do another pass to merge the output? I have the following settings: hive.merge.mapfiles=true hive.merge.ma