[ 
https://issues.apache.org/jira/browse/HIVE-26035?focusedWorklogId=836612&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-836612
 ]

ASF GitHub Bot logged work on HIVE-26035:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/Jan/23 13:12
            Start Date: 03/Jan/23 13:12
    Worklog Time Spent: 10m 
      Work Description: VenuReddy2103 opened a new pull request, #3905:
URL: https://github.com/apache/hive/pull/3905

   ### What changes were proposed in this pull request?
   Proposing to use direct SQL for add partitions.
   1. Added a datanucleus property `datanucleus.query.sql.allowAll` to allow 
insert operation from JDO SQL(default value is true).
   2. Added a property `metastore.direct.sql.max.parameters.in.insert` to 
control the maximum rows in multiple row insert query to database.
   3. Implemented utility to get identity value from the DataNucleus value 
generator.
   4. Implemented multiple row insert in batches for all the tables involved in 
add partitions.
   5. Implemented fallback to JDO based in case of any failure in direct SQL.
   
   ### Why are the changes needed?
   When data is loaded/inserted to the dynamic partition table, many partitions 
are created based on the cardinality of the partition column(s). These 
partitions are updated to the hive metastore backing datastore using 
`pm.makePersistentAll()` datanucleus JDO API. Partition model class depends 
on(or holds reference to) several other model classes/structs such as Storage 
descriptor, Serdes, Serde params, Partition params, Partition key values etc. 
When a model class instance is persisted using JDO API, datanucleus internally 
does cascade persistence of all the other model instances it depends on 
recursively. During this process, it fires a lot of SQL queries to the backing 
datastore for the tables of the model classes/structs respectively. Many of the 
queries among them are redundant and takes longer time when partitions are 
more. [HIVE-26035](https://issues.apache.org/jira/browse/HIVE-26035)
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Tested the scenario manually.  Ran all the unittest cases. Below is the 
performance measurement in the local system using MySQL database with 
metastore-benchmarks tool available at  
[https://github.com/akolb1/hclient/tree/master/metastore-benchmarks](https://github.com/akolb1/hclient/tree/master/metastore-benchmarks).
   <img width="616" alt="image" 
src="https://user-images.githubusercontent.com/35334869/210361898-6a18ed2b-5347-4001-bbea-fa0f2dddf5fc.png";>
   
   




Issue Time Tracking
-------------------

            Worklog Id:     (was: 836612)
    Remaining Estimate: 0h
            Time Spent: 10m

> Explore moving to directsql for ObjectStore::addPartitions
> ----------------------------------------------------------
>
>                 Key: HIVE-26035
>                 URL: https://issues.apache.org/jira/browse/HIVE-26035
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently {{addPartitions}} uses datanuclues and is super slow for large 
> number of partitions. It will be good to move to direct sql. Lots of repeated 
> SQLs can be avoided as well (e.g SDS, SERDE, TABLE_PARAMS)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to