[ 
https://issues.apache.org/jira/browse/HIVE-28450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sercan Tekin updated HIVE-28450:
--------------------------------
    Affects Version/s: 4.0.0
                       3.1.3

> Follow the array size of JVM in Hive transferable objects
> ---------------------------------------------------------
>
>                 Key: HIVE-28450
>                 URL: https://issues.apache.org/jira/browse/HIVE-28450
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 3.1.3, 4.0.0
>            Reporter: Sercan Tekin
>            Priority: Major
>
> We are experiencing an issue with a partitioned table in Hive. When querying 
> the table via the Hive CLI, the data retrieval works as expected without any 
> errors. However, when attempting to query the same table through Spark, we 
> encounter the following error in the HMS logs:
> {code:java}
> 2024-01-30 23:03:59,052 main DEBUG 
> org.apache.logging.log4j.core.util.SystemClock does not support precise 
> timestamps.
> Exception in thread "pool-7-thread-4" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
>       at java.util.Arrays.copyOf(Arrays.java:3236)
>       at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>       at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>       at 
> org.apache.thrift.transport.TSaslTransport.write(TSaslTransport.java:473)
>       at 
> org.apache.thrift.transport.TSaslServerTransport.write(TSaslServerTransport.java:42)
>       at 
> org.apache.thrift.protocol.TBinaryProtocol.writeString(TBinaryProtocol.java:227)
>       at 
> org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.write(FieldSchema.java:517)
>       at 
> org.apache.hadoop.hive.metastore.api.FieldSchema$FieldSchemaStandardScheme.write(FieldSchema.java:456)
>       at 
> org.apache.hadoop.hive.metastore.api.FieldSchema.write(FieldSchema.java:394)
>       at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.write(StorageDescriptor.java:1423)
>       at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor$StorageDescriptorStandardScheme.write(StorageDescriptor.java:1250)
>       at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.write(StorageDescriptor.java:1116)
>       at 
> org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.write(Partition.java:1033)
>       at 
> org.apache.hadoop.hive.metastore.api.Partition$PartitionStandardScheme.write(Partition.java:890)
>       at 
> org.apache.hadoop.hive.metastore.api.Partition.write(Partition.java:786)
>       at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.write(ThriftHiveMetastore.java)
>       at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result$get_partitions_resultStandardScheme.write(ThriftHiveMetastore.java)
>       at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.write(ThriftHiveMetastore.java)
>       at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58)
>       at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
>       at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:603)
>       at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor$1.run(HadoopThriftAuthBridge.java:600)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
>       at 
> org.apache.hadoop.hive.thrift.HadoopThriftAuthBridge$Server$TUGIAssumingProcessor.process(HadoopThriftAuthBridge.java:600)
>       at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:313)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:750)
> Exception in thread "pool-7-thread-6" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
> Exception in thread "pool-7-thread-9" java.lang.OutOfMemoryError: Requested 
> array size exceeds VM limit
> {code}
> This error appears to be related to the JVM’s conservative approach to array 
> size allocation, which limits the maximum size of arrays to prevent 
> OutOfMemoryError exceptions. For reference, you can see a similar 
> implementation in the JVM source code here: 
> https://github.com/openjdk/jdk/blob/0e0dfca21f64ecfcb3e5ed7cdc2a173834faa509/src/java.base/share/classes/java/io/InputStream.java#L307-L313
> Spark side implemented similar limit on their side, it would be good to 
> implement the same thing on Hive side - 
> https://github.com/apache/spark/blob/e5a5921968c84601ce005a7785bdd08c41a2d862/common/utils/src/main/scala/org/apache/spark/unsafe/array/ByteArrayUtils.java
> Workaround:
> As a temporary workaround, I have been able to mitigate the issue by setting 
> the hive.metastore.batch.retrieve.table.partition.max configuration to a 
> lower value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to