[ 
https://issues.apache.org/jira/browse/HIVE-29116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikram Ahuja updated HIVE-29116:
--------------------------------
    Issue Type: Improvement  (was: Bug)

> Create a DDL for setting hive default partition name at the table level
> -----------------------------------------------------------------------
>
>                 Key: HIVE-29116
>                 URL: https://issues.apache.org/jira/browse/HIVE-29116
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Vikram Ahuja
>            Assignee: Vikram Ahuja
>            Priority: Major
>              Labels: pull-request-available
>
> The Hive property {{hive.exec.default.partition.name}} is currently a 
> *session-level configuration* that determines the directory name used when 
> partition column values are {{NULL}} or empty strings. While useful for 
> controlling default behavior, this setting introduces *serious 
> inconsistencies and operational challenges* in multi-user or shared 
> environments.
>  
> h3. *Problems Caused by Session-Scoped Default Partition Names*
> h4. 1. *Inconsistent Partition Layouts Across Sessions*
> Different users or jobs may configure different values (e.g., 
> {{{}__HIVE_DEFAULT_PARTITION__{}}}, {{{}NA{}}}, {{{}UNKNOWN{}}}) resulting in 
> *multiple folders for NULL partitions* under the same table. This leads to:
>  * Fragmentation of data
>  * Unreliable query results
>  * Duplicate rows or missed data
> h4. 2. *Interoperability Failures with External Engines*
> Engines like {*}Apache Spark{*}, {*}Trino{*}, and *Presto* are unaware of the 
> session-scoped Hive config, which results in:
>  * Missing or partially loaded data when querying Hive tables
>  * Incorrect partition pruning or data skipping
>  * Silent logical errors
> h4. 3. *Partition Management & Repair Failures*
> Commands like {{{}MSCK REPAIR TABLE{}}}, or tools that list partitions, 
> automatic partition management may treat differently named default partitions 
> as distinct — making repair, cleanup, and compaction logic brittle.
> h4. 4. *Difficulties During Migration to Iceberg*
> Modern table formats like *Iceberg* assume consistent and valid partition 
> paths. When migrating, multiple default partition folders complicate the 
> process and increase the risk of data loss or inconsistency.
> h4. 5. *Storage Bloat & Retention Policy Issues*
> Data with NULL partitions can accumulate across multiple folders and may be 
> missed by retention or cleanup tools. This causes:
>  * Inefficient storage
>  * Missed deletes
>  * Garbage accumulation
> h4. 6. *Risk of Human Error and Debugging Overhead*
> Since this is a session-level config, developers and analysts may forget to 
> set it consistently — especially during ad-hoc queries or notebook 
> exploration. This leads to:
>  * Hard-to-reproduce bugs
>  * Test environment differences
>  * Broken CI/CD data tests
>  
>  
> h3. *Proposed Improvements*
> We propose the following:
>  # Create a DDL for setting hive default partition name at the table level.
>  # Make hive.exec.deault.partition.name immutable at runtime thus only 
> allowing it at a cluster level.
> The above 2 changes  ensures a single consistent default partition folder per 
> table, since this value will be at the table level other engines can also 
> utilize this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to