[ 
https://issues.apache.org/jira/browse/HIVE-26893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yukun Zhang updated HIVE-26893:
-------------------------------
    Description: 
有几个 HMS API 会返回分区列表,例如 
get_partitions_ps()、get_partitions_by_names()、add_partitions_req() 和 
needResult=true 等。每个分区实例都将具有唯一的 FieldSchemas 列表作为分区架构:
{code:java}
org.apache.hadoop.hive.metastore.api.Partition
-> org.apache.hadoop.hive.metastore.api.StorageDescriptor
   ->  cols: list<org.apache.hadoop.hive.metastore.api.FieldSchema> {code}
对于宽表(例如,使用 2k 列),这可能会占用大量的内存占用。请参阅 IMPALA-11812 中的堆直方图作为示例。

像 Impala 这样的一些引擎实际上并不使用/尊重分区级别的架构。传输它们是一种网络/serde 资源的浪费。如果这些 API 
提供一个可选的布尔标志来忽略分区模式,那就太好了。因此,HMS客户端(例如Impala)以后不需要清除它们(以保存mem)。

  was:
There are several HMS APIs that return a list of partitions, e.g. 
get_partitions_ps(), get_partitions_by_names(), add_partitions_req() with 
needResult=true, etc. Each partition instance will have a unique list of 
FieldSchemas as the partition schema:
{code:java}
org.apache.hadoop.hive.metastore.api.Partition
-> org.apache.hadoop.hive.metastore.api.StorageDescriptor
   ->  cols: list<org.apache.hadoop.hive.metastore.api.FieldSchema> {code}
This could occupy a large memory footprint for wide tables (e.g. with 2k cols). 
See the heap histogram in IMPALA-11812 as an example.

Some engines like Impala doesn't actually use/respect the partition level 
schema. It's a waste of network/serde resource to transmit them. It'd be nice 
if these APIs provide an optional boolean flag for ignoring partition schemas. 
So HMS clients (e.g. Impala) don't need to clear them later (to save mem).


> Extend batch partition APIs to ignore partition schemas
> -------------------------------------------------------
>
>                 Key: HIVE-26893
>                 URL: https://issues.apache.org/jira/browse/HIVE-26893
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: Quanlong Huang
>            Assignee: Sai Hemanth Gantasala
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> 有几个 HMS API 会返回分区列表,例如 
> get_partitions_ps()、get_partitions_by_names()、add_partitions_req() 和 
> needResult=true 等。每个分区实例都将具有唯一的 FieldSchemas 列表作为分区架构:
> {code:java}
> org.apache.hadoop.hive.metastore.api.Partition
> -> org.apache.hadoop.hive.metastore.api.StorageDescriptor
>    ->  cols: list<org.apache.hadoop.hive.metastore.api.FieldSchema> {code}
> 对于宽表(例如,使用 2k 列),这可能会占用大量的内存占用。请参阅 IMPALA-11812 中的堆直方图作为示例。
> 像 Impala 这样的一些引擎实际上并不使用/尊重分区级别的架构。传输它们是一种网络/serde 资源的浪费。如果这些 API 
> 提供一个可选的布尔标志来忽略分区模式,那就太好了。因此,HMS客户端(例如Impala)以后不需要清除它们(以保存mem)。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to