The file has trailing data. If you want to recover the data, you can use: % strings -3 -t d ~/Downloads/bucket_00000 | grep ORC
will print the offsets where ORC occurs with in the file: 0 ORC 4559 ORC That means that there is one intermediate footer within the file. If you slice the file at the right point (ORC offset + 4), you can get the data back: % dd bs=1 count=4563 < ~/Downloads/bucket_00000 > recover.orc and % orc-metadata recover.orc { "name": "recover.orc", "type": "struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<data_type:string,source_file_name:string,telco_id:int,begin_connection_time:bigint,duration:int,call_type_id:int,supplement_service_id:int,in_abonent_type:int,out_abonent_type:int,switch_id:string,inbound_bunch:bigint,outbound_bunch:bigint,term_cause:int,phone_card_number:string,in_info_directory_number:string,in_info_internal_number:string,dialed_digits:string,out_info_directory_number:string,out_info_internal_number:string,forwarding_identifier:string,border_switch_id:string>>", "rows": 115, "stripe count": 1, "format": "0.12", "writer version": "HIVE-8732", "compression": "zlib", "compression block": 16384, "file length": 4563, "content": 3454, "stripe stats": 339, "footer": 744, "postscript": 25, "row index stride": 10000, "user metadata": { "hive.acid.key.index": "71698156,0,114;", "hive.acid.stats": "115,0,0" }, "stripes": [ { "stripe": 0, "rows": 115, "offset": 3, "length": 3451, "index": 825, "data": 2353, "footer": 273 } ] } .. Owen On Fri, Aug 5, 2016 at 2:47 AM, Igor Kuzmenko <f1she...@gmail.com> wrote: > Unfortunately, I сan't provide more information, this file I got from our > tester and he already droped table. > > On Thu, Aug 4, 2016 at 9:16 PM, Prasanth Jayachandran < > pjayachand...@hortonworks.com> wrote: > >> Hi >> >> In case of streaming, when a transaction is open orc file is not closed >> and hence may not be flushed completely. Did the transaction commit >> successfully? Or was there any exception thrown during writes/commit? >> >> Thanks >> Prasanth >> >> On Aug 3, 2016, at 6:09 AM, Igor Kuzmenko <f1she...@gmail.com> wrote: >> >> Hello, I've got a malformed ORC file in my Hive table. File was created >> by Hive Streaming API and I have no idea under what circumstances it >> became corrupted. >> >> File on google drive: link >> <https://drive.google.com/file/d/0ByB92PAoAkrKeFFZRUN4WWVQY1U/view?usp=sharing> >> >> Exception message when trying to perform select from table: >> >> ERROR : Vertex failed, vertexName=Map 1, >> vertexId=vertex_1468498236400_1106_6_00, >> diagnostics=[Task failed, taskId=task_1468498236400_1106_6_00_000000, >> diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running >> task:java.lang.RuntimeException: java.lang.RuntimeException: >> java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException: >> Malformed ORC file hdfs://sorm-master01.msk.mts.r >> u:8020/apps/hive/warehouse/pstn_connections/dt=20160711/dire >> ctory_number_last_digit=5/delta_71700156_71700255/bucket_00000. Invalid >> postscript length 0 >> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAn >> dRunProcessor(TezProcessor.java:173) >> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProce >> ssor.java:139) >> at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(Log >> icalIOProcessorRuntimeTask.java:344) >> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable >> $1.run(TezTaskRunner.java:181) >> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable >> $1.run(TezTaskRunner.java:172) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:422) >> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro >> upInformation.java:1657) >> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable >> .callInternal(TezTaskRunner.java:172) >> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable >> .callInternal(TezTaskRunner.java:168) >> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1142) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: java.lang.RuntimeException: java.io.IOException: >> org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file >> hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pst >> n_connections/dt=20160711/directory_number_last_digit=5/delt >> a_71700156_71700255/bucket_00000. Invalid postscript length 0 >> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$T >> ezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedS >> plitsInputFormat.java:196) >> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$T >> ezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:142) >> at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMap >> red.java:113) >> at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecor >> d(MapRecordSource.java:61) >> at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run( >> MapRecordProcessor.java:326) >> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAn >> dRunProcessor(TezProcessor.java:150) >> ... 14 more >> Caused by: java.io.IOException: >> org.apache.hadoop.hive.ql.io.FileFormatException: >> Malformed ORC file hdfs://sorm-master01.msk.mts.r >> u:8020/apps/hive/warehouse/pstn_connections/dt=20160711/dire >> ctory_number_last_digit=5/delta_71700156_71700255/bucket_00000. Invalid >> postscript length 0 >> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handle >> RecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) >> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleR >> ecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) >> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader >> (HiveInputFormat.java:251) >> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$T >> ezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedS >> plitsInputFormat.java:193) >> ... 19 more >> Caused by: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed >> ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pst >> n_connections/dt=20160711/directory_number_last_digit=5/delt >> a_71700156_71700255/bucket_00000. Invalid postscript length 0 >> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter( >> ReaderImpl.java:236) >> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoF >> romFooter(ReaderImpl.java:376) >> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImp >> l.java:317) >> at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFil >> e.java:238) >> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(Or >> cInputFormat.java:1259) >> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordRea >> der(OrcInputFormat.java:1151) >> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader >> (HiveInputFormat.java:249) >> ... 20 more >> >> Does anyone encountered such a situation? >> >> >> >