Those metadata information are required to identify the file is a valid orc file. It will just have some bytes for ORC header, postscript information (compression, version, buffer size etc. that is specified via table properties). Its not completely safe to delete those empty bucket files as there are some known issue related to joins.
On Aug 18, 2015, at 8:46 AM, Juraj jiv <fatcap....@gmail.com<mailto:fatcap....@gmail.com>> wrote: Hi, yes i saw somewhere in sql scripts enabled bucketing adhoc via set command - "hive.enforce.bucketing" + "hive.optimize.bucketmapjoin" . So those metada information are required? I cant just delete those 43b files? JV On Tue, Aug 18, 2015 at 5:35 PM, Prasanth Jayachandran <j.prasant...@gmail.com<mailto:j.prasant...@gmail.com>> wrote: Are you using bucketing? If so those are empty ORC files without any data containing only metadata information. _____________________________ From: Juraj jiv <fatcap....@gmail.com<mailto:fatcap....@gmail.com>> Sent: Tuesday, August 18, 2015 8:28 AM Subject: Hive 12 - CDH 5.0.1 - many small files when using ORC table To: <user@hive.apache.org<mailto:user@hive.apache.org>> Hello all, i have question about ORC table format. We use it as for our datastore tables but during maintenance i noticed there is many small files inside tables which I presume doesn't contains any data. They are only 43bytes in size and they takes around 70% of all files inside table folder. For example (grep 43 bytes is size and other): hadoop@hadoopnn:~$ hdfs dfs -du -h /user/hive/warehouse/dwh.db/<table>/date_report_start_part=2015-07-30 | grep "^43 " | wc -l 7448 hadoop@hadoopnn:~$ hdfs dfs -du -h /user/hive/warehouse/dwh.db/<table>/date_report_start_part=2015-07-30 | grep -v "^43 " | wc -l 4712 Why is that? Why is there those many 43bytes files? Ascii content of the files is, which i guess is just ORC header: 0@▒▒▒" ▒▒ORC hive version: 0.12.0+cdh5.0.1+315 1.cdh5.0.1.p0.31 CDH 5 Thanks JV