Those metadata information are required to identify the file is a valid orc 
file. It will just have some bytes for ORC header, postscript information 
(compression, version, buffer size etc. that is specified via table 
properties). Its not completely safe to delete those empty bucket files as 
there are some known issue related to joins.

On Aug 18, 2015, at 8:46 AM, Juraj jiv 
<fatcap....@gmail.com<mailto:fatcap....@gmail.com>> wrote:

Hi, yes i saw somewhere in sql scripts enabled bucketing adhoc via set command 
- "hive.enforce.bucketing" + "hive.optimize.bucketmapjoin" . So those metada 
information are required? I cant just delete those 43b files?

JV

On Tue, Aug 18, 2015 at 5:35 PM, Prasanth Jayachandran 
<j.prasant...@gmail.com<mailto:j.prasant...@gmail.com>> wrote:
Are you using bucketing? If so those are empty ORC files without any data 
containing only metadata information.


_____________________________
From: Juraj jiv <fatcap....@gmail.com<mailto:fatcap....@gmail.com>>
Sent: Tuesday, August 18, 2015 8:28 AM
Subject: Hive 12 - CDH 5.0.1 - many small files when using ORC table
To: <user@hive.apache.org<mailto:user@hive.apache.org>>



Hello all,

i have question about ORC table format. We use it as for our datastore tables 
but during maintenance i noticed there is many small files inside tables which 
I presume doesn't contains any data. They are only 43bytes in size and they 
takes around 70% of all files inside table folder.

For example (grep 43 bytes is size and other):

hadoop@hadoopnn:~$ hdfs dfs -du -h 
/user/hive/warehouse/dwh.db/<table>/date_report_start_part=2015-07-30 | grep 
"^43 " | wc -l
7448
hadoop@hadoopnn:~$ hdfs dfs -du -h 
/user/hive/warehouse/dwh.db/<table>/date_report_start_part=2015-07-30 | grep -v 
"^43 " | wc -l
4712

Why is that? Why is there those many 43bytes files?

Ascii content of the files is, which i guess is just ORC header:
0@▒▒▒"
      ▒▒ORC

hive version:
0.12.0+cdh5.0.1+315     1.cdh5.0.1.p0.31     CDH 5

Thanks
JV




Reply via email to