[ 
https://issues.apache.org/jira/browse/HIVE-28450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sercan Tekin updated HIVE-28450:
--------------------------------
    Description: 
We are experiencing an issue with a partitioned table in Hive. When querying 
the table via the Hive CLI, the data retrieval works as expected without any 
errors. However, when attempting to query the same table through Spark, we 
encounter the following error in the HMS logs:

{code:java}
2024-01-30 23:03:59,052 main DEBUG 
org.apache.logging.log4j.core.util.SystemClock does not support precise 
timestamps.
Exception in thread "pool-7-thread-4" java.lang.OutOfMemoryError: Requested 
array size exceeds VM limit
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
        at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
        at 
org.apache.thrift.transport.TSaslTransport.write(TSaslTransport.java:473)
        at 
org.apache.thrift.transport.TSaslServerTransport.write(TSaslServerTransport.java:42)
        at 
org.apache.thrift.protocol.TBinaryProtocol.writeString(TBinaryProtocol.java:227)
        at 
org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.write(FieldSchema.java:517)
        at 
org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.write(FieldSchema.java:456)
        at 
org.apache.hadoop.hive.metastore.api.FieldSchema.write(FieldSchema.java:394)
        at 
org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.write(StorageDescriptor.java:1423)
        at 
org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.write(StorageDescriptor.java:1250)
        at 
org.apache.hadoop.hive.metastore.api.StorageDescriptor.write(StorageDescriptor.java:1116)
        at 
org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.write(Partition.java:1033)
        at 
org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.write(Partition.java:890)
        at 
org.apache.hadoop.hive.metastore.api.Partition.write(Partition.java:786)
        at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.write(ThriftHiveMetastore.java)
        at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.write(ThriftHiveMetastore.java)
        at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.write(ThriftHiveMetastore.java)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
        at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:603)
        at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:600)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
        at 
org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:600)
        at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:313)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Exception in thread "pool-7-thread-6" java.lang.OutOfMemoryError: Requested 
array size exceeds VM limit
Exception in thread "pool-7-thread-9" java.lang.OutOfMemoryError: Requested 
array size exceeds VM limit
{code}

This error appears to be related to the JVM’s conservative approach to array 
size allocation, which limits the maximum size of arrays to prevent 
OutOfMemoryError exceptions. For reference, you can see a similar 
implementation in the JVM source code here: 
https://github.com/openjdk/jdk/blob/0e0dfca21f64ecfcb3e5ed7cdc2a173834faa509/src/java.base/share/classes/java/io/InputStream.java#L307-L313

Spark side implemented similar limit on their side, it would be good to 
implement the same thing on Hive side - 
https://github.com/apache/spark/blob/e5a5921968c84601ce005a7785bdd08c41a2d862/common/utils/src/main/scala/org/apache/spark/unsafe/array/ByteArrayUtils.java

Workaround:
As a temporary workaround, I have been able to mitigate the issue by setting 
the hive.metastore.batch.retrieve.table.partition.max configuration to a lower 
value.


> Follow the array size of JVM in Hive transferable objects
> ---------------------------------------------------------
>
>                 Key: HIVE-28450
>                 URL: https://issues.apache.org/jira/browse/HIVE-28450
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sercan Tekin
>            Priority: Major
>
> We are experiencing an issue with a partitioned table in Hive. When querying 
> the table via the Hive CLI, the data retrieval works as expected without any 
> errors. However, when attempting to query the same table through Spark, we 
> encounter the following error in the HMS logs:
> {code:java}
> 2024-01-30 23:03:59,052 main DEBUG 
> org.apache.logging.log4j.core.util.SystemClock does not support precise 
> timestamps.
> Exception in thread "pool-7-thread-4" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
>       at java.util.Arrays.copyOf(Arrays.java:3236)
>       at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>       at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>       at 
> org.apache.thrift.transport.TSaslTransport.write(TSaslTransport.java:473)
>       at 
> org.apache.thrift.transport.TSaslServerTransport.write(TSaslServerTransport.java:42)
>       at 
> org.apache.thrift.protocol.TBinaryProtocol.writeString(TBinaryProtocol.java:227)
>       at 
> org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.write(FieldSchema.java:517)
>       at 
> org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.write(FieldSchema.java:456)
>       at 
> org.apache.hadoop.hive.metastore.api.FieldSchema.write(FieldSchema.java:394)
>       at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.write(StorageDescriptor.java:1423)
>       at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.write(StorageDescriptor.java:1250)
>       at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.write(StorageDescriptor.java:1116)
>       at 
> org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.write(Partition.java:1033)
>       at 
> org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.write(Partition.java:890)
>       at 
> org.apache.hadoop.hive.metastore.api.Partition.write(Partition.java:786)
>       at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.write(ThriftHiveMetastore.java)
>       at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.write(ThriftHiveMetastore.java)
>       at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.write(ThriftHiveMetastore.java)
>       at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58)
>       at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
>       at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:603)
>       at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:600)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>       at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:600)
>       at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:313)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:750)
> Exception in thread "pool-7-thread-6" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
> Exception in thread "pool-7-thread-9" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
> {code}
> This error appears to be related to the JVM’s conservative approach to array 
> size allocation, which limits the maximum size of arrays to prevent 
> OutOfMemoryError exceptions. For reference, you can see a similar 
> implementation in the JVM source code here: 
> https://github.com/openjdk/jdk/blob/0e0dfca21f64ecfcb3e5ed7cdc2a173834faa509/src/java.base/share/classes/java/io/InputStream.java#L307-L313
> Spark side implemented similar limit on their side, it would be good to 
> implement the same thing on Hive side - 
> https://github.com/apache/spark/blob/e5a5921968c84601ce005a7785bdd08c41a2d862/common/utils/src/main/scala/org/apache/spark/unsafe/array/ByteArrayUtils.java
> Workaround:
> As a temporary workaround, I have been able to mitigate the issue by setting 
> the hive.metastore.batch.retrieve.table.partition.max configuration to a 
> lower value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to