Re: Hive produces very small files despite hive.merge...=true settings

Leo Alekseyev Thu, 18 Nov 2010 19:48:52 -0800

I thought I was running Hive with those changes merged in, but to make
sure, I built the latest trunk version.  The behavior changed somewhat
(as in, it runs 2 stages instead of 1), but it still generates the
same number of files (# of files generated is equal to the number of
the original mappers, so I have no idea what the second stage is
actually doing).


See below for query / explain query.  Stage 1 runs always; Stage 3
runs if hive.merge.mapfiles=true is set, but it still generates lots
of small files.

The query is kind of large, but in essence it's simply
insert overwrite table foo partition(bar) select [columns] from
[table] tablesample(bucket 1 out of 10000 on rand()) where
[conditions].


explain insert overwrite table hbase_prefilter3_us_sample partition
(ds) select 
server_host,client_ip,time_stamp,concat(server_host,':',regexp_extract(request_url,'/[^/]+/[^/]+/([^/]+)$',1)),referrer,parse_url(referrer,'HOST'),user_agent,cookie,geoip_int(client_ip,
'COUNTRY_CODE',  './GeoIP.dat'),'',ds from alogs_master
TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand()) am_s where
am_s.ds='2010-11-05' and am_s.request_url rlike
'^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$' and
geoip_int(am_s.client_ip, 'COUNTRY_CODE',  './GeoIP.dat')='US';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF alogs_master (TOK_TABLESAMPLE 1
10000 (TOK_FUNCTION rand)) am_s)) (TOK_INSERT (TOK_DESTINATION
(TOK_TAB hbase_prefilter3_us_sample (TOK_PARTSPEC (TOK_PARTVAL ds))))
(TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL server_host)) (TOK_SELEXPR
(TOK_TABLE_OR_COL client_ip)) (TOK_SELEXPR (TOK_TABLE_OR_COL
time_stamp)) (TOK_SELEXPR (TOK_FUNCTION concat (TOK_TABLE_OR_COL
server_host) ':' (TOK_FUNCTION regexp_extract (TOK_TABLE_OR_COL
request_url) '/[^/]+/[^/]+/([^/]+)$' 1))) (TOK_SELEXPR
(TOK_TABLE_OR_COL referrer)) (TOK_SELEXPR (TOK_FUNCTION parse_url
(TOK_TABLE_OR_COL referrer) 'HOST')) (TOK_SELEXPR (TOK_TABLE_OR_COL
user_agent)) (TOK_SELEXPR (TOK_TABLE_OR_COL cookie)) (TOK_SELEXPR
(TOK_FUNCTION geoip_int (TOK_TABLE_OR_COL client_ip) 'COUNTRY_CODE'
'./GeoIP.dat')) (TOK_SELEXPR '') (TOK_SELEXPR (TOK_TABLE_OR_COL ds)))
(TOK_WHERE (and (and (= (. (TOK_TABLE_OR_COL am_s) ds) '2010-11-05')
(rlike (. (TOK_TABLE_OR_COL am_s) request_url)
'^/img[0-9]+/[0-9]+/[^.]+\.(png|jpg|gif|mp4|swf)$')) (= (TOK_FUNCTION
geoip_int (. (TOK_TABLE_OR_COL am_s) client_ip) 'COUNTRY_CODE'
'./GeoIP.dat') 'US')))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-5 depends on stages: Stage-1 , consists of Stage-4, Stage-3
  Stage-4
  Stage-0 depends on stages: Stage-4, Stage-3
  Stage-2 depends on stages: Stage-0
  Stage-3

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        am_s
          TableScan
            alias: am_s
            Filter Operator
              predicate:
                  expr: (((hash(rand()) & 2147483647) % 10000) = 0)
                  type: boolean
              Filter Operator
                predicate:
                    expr: ((request_url rlike
'^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$') and
(GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US'))
                    type: boolean
                Filter Operator
                  predicate:
                      expr: (((ds = '2010-11-05') and (request_url
rlike '^/img[0-9]+/[0-9]+/[^.]+.(png|jpg|gif|mp4|swf)$')) and
(GenericUDFGeoIP ( client_ip, 'COUNTRY_CODE', './GeoIP.dat' ) = 'US'))
                      type: boolean
                  Select Operator
                    expressions:
                          expr: server_host
                          type: string
                          expr: client_ip
                          type: int
                          expr: time_stamp
                          type: int
                          expr: concat(server_host, ':',
regexp_extract(request_url, '/[^/]+/[^/]+/([^/]+)$', 1))
                          type: string
                          expr: referrer
                          type: string
                          expr: parse_url(referrer, 'HOST')
                          type: string
                          expr: user_agent
                          type: string
                          expr: cookie
                          type: string
                          expr: GenericUDFGeoIP ( client_ip,
'COUNTRY_CODE', './GeoIP.dat' )
                          type: string
                          expr: ''
                          type: string
                          expr: ds
                          type: string
                    outputColumnNames: _col0, _col1, _col2, _col3,
_col4, _col5, _col6, _col7, _col8, _col9, _col10
                    File Output Operator
                      compressed: true
                      GlobalTableId: 1
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          name: hbase_prefilter3_us_sample

  Stage: Stage-5
    Conditional Operator

  Stage: Stage-4
    Move Operator
      files:
          hdfs directory: true
          destination:
hdfs://namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_17-58-36_843_6726655151866456030/-ext-10000

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            ds
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: hbase_prefilter3_us_sample

  Stage: Stage-2
    Stats-Aggr Operator

  Stage: Stage-3
    Map Reduce
      Alias -> Map Operator Tree:
        
hdfs://namenode.imageshack.us:9000/tmp/hive-hadoop/hive_2010-11-18_17-58-36_843_6726655151866456030/-ext-10002
            File Output Operator
              compressed: true
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: hbase_prefilter3_us_sample




On Thu, Nov 18, 2010 at 3:44 PM, Ning Zhang <nzh...@fb.com> wrote:
> I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 need to 
> be there for merging to take place. HIVE-1307 was committed to trunk on 08/25 
> and HIVE-1622 was committed on 09/13. The simplest way is to update your Hive 
> trunk and rerun the query. If it still doesn't work maybe you can post your 
> query and the result of 'explain <query>' and we can take a look.
>
> Ning
>
> On Nov 18, 2010, at 2:57 PM, Leo Alekseyev wrote:
>
>> Hi Ning,
>> For the dataset I'm experimenting with, the total size of the output
>> is 2mb, and the files are at most a few kb in size.  My
>> hive.input.format was set to default HiveInputFormat; however, when I
>> set it to CombineHiveInputFormat, it only made the first stage of the
>> job use fewer mappers.  The merge job was *still* filtered out at
>> runtime.  I also tried set hive.mergejob.maponly=false; that didn't
>> have any effect.
>>
>> I am a bit at a loss what to do here.  Is there a way to see what's
>> going on exactly using e.g. debug log levels?..  Btw, I'm also using
>> dynamic partitions; could that somehow be interfering with the merge
>> job?..
>>
>> I'm running a relatively fresh Hive from trunk (built maybe a month ago).
>>
>> --Leo
>>
>> On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang <nzh...@fb.com> wrote:
>>> The settings looks good. The parameter hive.merge.size.smallfiles.avgsize 
>>> is used to determine at run time if a merge should be triggered: if the 
>>> average size of the files in the partition is SMALLER than the parameter 
>>> and there are more than 1 file, the merge should be scheduled. Can you try 
>>> to see if you have any big files as well in your resulting partition? If it 
>>> is because of a very large file, you can set the parameter large enough.
>>>
>>> Another possibility is that your Hadoop installation does not support 
>>> CombineHiveInputFormat, which is used for the new merge job. Someone 
>>> reported previously merge was not successful because of this. If that's the 
>>> case, you can turn off CombineHiveInputFormat and use the old 
>>> HiveInputFormat (though slower) by setting hive.mergejob.maponly=false.
>>>
>>> Ning
>>> On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:
>>>
>>>> I have jobs that sample (or generate) a small amount of data from a
>>>> large table.  At the end, I get e.g. about 3000 or more files of 1kb
>>>> or so.  This becomes a nuisance.  How can I make Hive do another pass
>>>> to merge the output?  I have the following settings:
>>>>
>>>> hive.merge.mapfiles=true
>>>> hive.merge.mapredfiles=true
>>>> hive.merge.size.per.task=256000000
>>>> hive.merge.size.smallfiles.avgsize=16000000
>>>>
>>>> After setting hive.merge* to true, Hive started indicating "Total
>>>> MapReduce jobs = 2".  However, after generating the
>>>> lots-of-small-files table, Hive says:
>>>> Ended Job = job_201011021934_1344
>>>> Ended Job = 781771542, job is filtered out (removed at runtime).
>>>>
>>>> Is there a way to force the merge, or am I missing something?
>>>> --Leo
>>>
>>>
>
>

Re: Hive produces very small files despite hive.merge...=true settings

Reply via email to