[ 
https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058109#comment-13058109
 ] 

jirapos...@reviews.apache.org commented on HIVE-2246:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/985/
-----------------------------------------------------------

Review request for hive.


Summary
-------

We can re-organize the JDO models to reduce space usage to keep the metastore 
scalable for the future. Currently, partitions are the fastest growing objects 
in the metastore, and the metastore keeps a separate copy of the columns list 
for each partition. We can normalize the metastore db by decoupling Columns 
from Storage Descriptors and not storing duplicate lists of the columns for 
each partition.

An idea is to create an additional level of indirection with a "Column 
Descriptor" that has a list of columns. A table has a reference to its latest 
Column Descriptor (note: a table may have more than one Column Descriptor in 
the case of schema evolution). Partitions and Indexes can reference the same 
Column Descriptors as their parent table.

Currently, the COLUMNS table in the metastore has roughly (number of partitions 
+ number of tables) * (average number of columns pertable) rows. We can reduce 
this to (number of tables) * (average number of columns per table) rows, while 
incurring a small cost proportional to the number of tables to store the Column 
Descriptors.


This addresses bug HIVE-2246.
    https://issues.apache.org/jira/browse/HIVE-2246


Diffs
-----

  trunk/metastore/if/hive_metastore.thrift 1140399 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MColumnDescriptor.java
 PRE-CREATION 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MDatabase.java 
1140399 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MFieldSchema.java
 1140399 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MIndex.java 
1140399 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartition.java
 1140399 
  
trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MStorageDescriptor.java
 1140399 
  trunk/metastore/src/model/org/apache/hadoop/hive/metastore/model/MTable.java 
1140399 
  trunk/metastore/src/model/package.jdo 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/index/TableBasedIndexHandler.java 
1140399 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/index/bitmap/BitmapIndexHandler.java
 1140399 
  
trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/MetaDataFormatUtils.java 
1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 1140399 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 
1140399 

Diff: https://reviews.apache.org/r/985/diff


Testing
-------

Haven't run any unit tests yet, just qualitative testing so far.


Thanks,

Sohan



> Dedupe tables' column schemas from partitions in the metastore db
> -----------------------------------------------------------------
>
>                 Key: HIVE-2246
>                 URL: https://issues.apache.org/jira/browse/HIVE-2246
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>            Reporter: Sohan Jain
>            Assignee: Sohan Jain
>
> We can re-organize the JDO models to reduce space usage to keep the metastore 
> scalable for the future.  Currently, partitions are the fastest growing 
> objects in the metastore, and the metastore keeps a separate copy of the 
> columns list for each partition.  We can normalize the metastore db by 
> decoupling Columns from Storage Descriptors and not storing duplicate lists 
> of the columns for each partition. 
> An idea is to create an additional level of indirection with a "Column 
> Descriptor" that has a list of columns.  A table has a reference to its 
> latest Column Descriptor (note: a table may have more than one Column 
> Descriptor in the case of schema evolution).  Partitions and Indexes can 
> reference the same Column Descriptors as their parent table.
> Currently, the COLUMNS table in the metastore has roughly (number of 
> partitions + number of tables) * (average number of columns pertable) rows.  
> We can reduce this to (number of tables) * (average number of columns per 
> table) rows, while incurring a small cost proportional to the number of 
> tables to store the Column Descriptors.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to