[ https://issues.apache.org/jira/browse/HIVE-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161749#comment-13161749 ]
Ashutosh Chauhan commented on HIVE-2246: ---------------------------------------- Thanks Namit for pointing this out. HCatalog looks into the columns information of partitions, so it will have an issue. Do you have a fix or it or if you can point out which part of script has a bug, we can take a look. > Dedupe tables' column schemas from partitions in the metastore db > ----------------------------------------------------------------- > > Key: HIVE-2246 > URL: https://issues.apache.org/jira/browse/HIVE-2246 > Project: Hive > Issue Type: Improvement > Components: Metastore > Reporter: Sohan Jain > Assignee: Sohan Jain > Fix For: 0.8.0 > > Attachments: HIVE-2246.2.patch, HIVE-2246.3.patch, HIVE-2246.4.patch, > HIVE-2246.8.patch > > > Note: this patch proposes a schema change, and is therefore incompatible with > the current metastore. > We can re-organize the JDO models to reduce space usage to keep the metastore > scalable for the future. Currently, partitions are the fastest growing > objects in the metastore, and the metastore keeps a separate copy of the > columns list for each partition. We can normalize the metastore db by > decoupling Columns from Storage Descriptors and not storing duplicate lists > of the columns for each partition. > An idea is to create an additional level of indirection with a "Column > Descriptor" that has a list of columns. A table has a reference to its > latest Column Descriptor (note: a table may have more than one Column > Descriptor in the case of schema evolution). Partitions and Indexes can > reference the same Column Descriptors as their parent table. > Currently, the COLUMNS table in the metastore has roughly (number of > partitions + number of tables) * (average number of columns pertable) rows. > We can reduce this to (number of tables) * (average number of columns per > table) rows, while incurring a small cost proportional to the number of > tables to store the Column Descriptors. > Please see the latest review board for additional implementation details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira