Re: Error using ORC Format with Hive

Prasanth Jayachandran Sat, 05 Apr 2014 01:07:22 -0700

Amit, 

I am not sure about your datanode issue. But definitely its not related to ORC 
writing 500 rows of kv1.txt file.


Also, keeping stripe size to 4MB is on the lower side. The default ORC size of 
256MB is chosen because of better data read efficiency. Having many small 
stripes will also slightly increase the ORC file size for store stripe level 
metadata information in the file footer.
The query time is dependent on the type of the query. Since ORC is columnar 
file format, reading more columns will not give significant performance gains. 
Reading small set of columns with filtering of rows using where conditions will 
read only small chunks of data and hence will improve the query time. For more 
significant improvement you might want to enable vectorized query execution 
that works nice with ORC. Also make sure the input format is set to 
HiveInputFormat (default is CombineHiveInputFormat) to take advantage of ORC’s 
split elimination feature. ORC stores stripe level statistics that are used to 
eliminate input splits that doesn’t satisfy the specified where condition. You 
can set the input format using hive config 
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat. 

Thanks
Prasanth Jayachandran

On Apr 5, 2014, at 12:48 AM, Amit Tewari <amittew...@gmail.com> wrote:

> Thanks for the reply. I did solve protobuf issue by upgrading to 2.5 but then 
> hive 0.12 also started showing the same issue as 0.13 and 0.14
> 
> I was working through  cli
> 
> Turns out issue was due to space available (not) to data node. Let me 
> elaborate for others in the list. 
> 
> I had about 2GB available on the partition where data node directory was 
> configured (the name node and data node space was on the same directory tree 
> but different directories, off course). I inserted kv1.txt (few KBs) to 
> table#1 (stored as textfile) and then tried to "insert into table#2 select * 
> table#1". Table#2 was stored as Orc.  It was difficult for me to guess that 
> converted Orc data would be too big to fit in 2GB.  Especially when data node 
> logs did not have any error. Nor was there reserve configured for HDFS. I 
> still don't know why it needs so much space however I could reproduce the 
> error simply by pushing a 300MB file to HDFS "hdfs dfs -put ". Thus realizing 
> that it's a space issue. Migrated datanode  to a bigger partition and 
> everything is fine now. 
> 
> On a separate note I am not seeing any significant query time improvement by 
> pushing data into ORC. About 25% yeah but no where close to multiples I was 
> hoping. I changed the striping to 4MB. Tried creating index every 10k rows. 
> Inserted 6 million rows and did many different type of queries. Any ideas 
> people what I might be missing  ? 
> 
> Amit 
> 
> Sent from my mobile device, please excuse the typos
> 
> On Apr 4, 2014, at 8:21 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote:
> 
>> Amit,
>> 
>> Are you executing your select for conversion to orc via beeline, or hive 
>> cli? From looking at your logs, it appears that you do not have permissions 
>> in hdfs to write the resultant orc data. Check permissions in hdfs to ensure 
>> that your user has write permissions to write to hive warehouse.
>> 
>> I forwarded you a previous thread regarding hive 12 protobuf issues.
>> 
>> Regards,
>> 
>> Bryan Jeffrey
>> 
>> On Apr 4, 2014 8:14 PM, "Amit Tewari" <amittew...@gmail.com> wrote:
>> I checked out and build hive 0.13. Tried with same results. i.e. 
>> eRpcServer.addBlock(NameNodeRpcServer.java:555)
>>     at File 
>> /tmp/hive-hduser/hive_2014-04-04_20-34-43_550_7470522328893486504-1/_task_tmp.-ext-10002/_tmp.000000_3
>>  could only be replicated to 0 nodes instead of minReplication (=1).  There 
>> are 1 datanode(s) running and no node(s) are excluded in this operation.
>> 
>> 
>> 
>> I also tried it with the release version of hive 0.12 and that gave me a 
>> different error. Related to protobuffer incompatibility (pasted below)
>> 
>> So at this point I can't run even the basic use case with ORC storage..
>> 
>> Any pointers would be very helpful.
>> 
>> Amit
>> 
>> Error: java.lang.RuntimeException: Hive Runtime Error while closing operators
>>     at 
>> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:240)
>>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>>     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
>> 
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at javax.security.auth.Subject.doAs(Subject.java:415)
>>     at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>>     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
>> Caused by: java.lang.UnsupportedOperationException: This is supposed to be 
>> overridden by subclasses.
>>     at 
>> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
>>     at 
>> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.getSerializedSize(OrcProto.java:3046)
>>     at 
>> com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
>>     at 
>> com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
>>     at 
>> org.apache.hadoop.hive.ql.io.orc.OrcProto$RowIndexEntry.getSerializedSize(OrcProto.java:4129)
>>     at 
>> com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag(CodedOutputStream.java:749)
>>     at 
>> com.google.protobuf.CodedOutputStream.computeMessageSize(CodedOutputStream.java:530)
>>     at 
>> org.apache.hadoop.hive.ql.io.orc.OrcProto$RowIndex.getSerializedSize(OrcProto.java:4641)
>>     at 
>> com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:75)
>>     at 
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$TreeWriter.writeStripe(WriterImpl.java:548)
>>     at 
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.writeStripe(WriterImpl.java:1328)
>>     at 
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:1699)
>>     at 
>> org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:1868)
>>     at 
>> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:95)
>>     at 
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:181)
>>     at 
>> org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:866)
>>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:596)
>>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
>>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
>>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)
>>     at 
>> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)
>> 
>> Amit
>> 
>> 
>> On 4/4/14 2:28 PM, Amit Tewari wrote:
>>> Hi All,
>>> 
>>> I am just trying to do some simple tests to see speedup in hive query with 
>>> Hive 0.14 (trunk version this morning). Just tried to use sample test case 
>>> to start with. First wanted to see how much I can speed up using ORC 
>>> format. 
>>> 
>>> However for some reason I can't insert data into the table with ORC format. 
>>> It fails with Exception "File <filename> could only be replicated to 0 
>>> nodes instead of minReplication (=1).  There are 1 datanode(s) running and 
>>> no node(s) are excluded in this operation" 
>>> 
>>> I can however run inserting data into text table without any issue. 
>>> 
>>> I have included the step below. 
>>> 
>>> Any pointers would be appreciated. 
>>> 
>>> Amit
>>> 
>>> 
>>> 
>>> I have a single node setup with minimal settings. JPS output is as follows 
>>> $ jps
>>> 9823 NameNode
>>> 12172 JobHistoryServer
>>> 9903 DataNode
>>> 14895 Jps
>>> 11796 ResourceManager
>>> 12034 NodeManager
>>> Running Hadoop 0.2.2 with Yarn.
>>> 
>>> 
>>> 
>>> Step1
>>> 
>>> CREATE TABLE pokes (foo INT, bar STRING);
>>> 
>>> Step 2
>>> 
>>> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE 
>>> pokes;
>>> 
>>> Step 3
>>> CREATE TABLE pokes_1 (foo INT, bar STRING) 
>>> 
>>> Step 4
>>> 
>>> Insert into table pokes_1 select * from pokes;
>>> 
>>> Step 5.
>>> 
>>> CREATE TABLE pokes_orc (foo INT, bar STRING) stored as orc;
>>> 
>>> Step 6. 
>>> 
>>> insert into pokes_orc select * from pokes; <__FAILED__ with Exception below 
>>> >
>>> 
>>> eRpcServer.addBlock(NameNodeRpcServer.java:555)
>>>     at File 
>>> /tmp/hive-hduser/hive_2014-04-04_20-34-43_550_7470522328893486504-1/_task_tmp.-ext-10002/_tmp.000000_3
>>>  could only be replicated to 0 nodes instead of minReplication (=1).  There 
>>> are 1 datanode(s) running and no node(s) are excluded in this operation.
>>>     at 
>>> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1384)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2477)
>>>     at 
>>> org.apache.hadoop.hdfs.server.namenode.NameNodorg.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387)
>>>     at 
>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59582)
>>>     at 
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>>>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
>>>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
>>>     at java.security.AccessController.doPrivileged(Native Method)
>>>     at javax.security.auth.Subject.doAs(Subject.java:415)
>>>     at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
>>>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)
>>> 
>>>     at 
>>> org.apache.hadoop.hive.ql.exec.FileSinkOperator$FSPaths.closeWriters(FileSinkOperator.java:168)
>>>     at 
>>> org.apache.hadoop.hive.ql.exec.FileSinkOperator.closeOp(FileSinkOperator.java:843)
>>>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:577)
>>>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:588)
>>>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:588)
>>>     at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:588)
>>>     at 
>>> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:227)
>>>     ... 8 more
>>> 
>>> 
>>> Step 7
>>> 
>>> Insert overwrite table pokes_1 select * from pokes; <Success> 
>>> 
>> 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Error using ORC Format with Hive

Reply via email to