[ https://issues.apache.org/jira/browse/HIVE-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13776952#comment-13776952 ]
Roshan Naik commented on HIVE-4196: ----------------------------------- Thanks Ashutosh. Since your recommendations apply to subtask HIVE-5138, I have copied ur comments over to it. I will address them there. > Support for Streaming Partitions in Hive > ---------------------------------------- > > Key: HIVE-4196 > URL: https://issues.apache.org/jira/browse/HIVE-4196 > Project: Hive > Issue Type: New Feature > Components: Database/Schema, HCatalog > Affects Versions: 0.10.1 > Reporter: Roshan Naik > Assignee: Roshan Naik > Attachments: HCatalogStreamingIngestFunctionalSpecificationandDesign- > apr 29- patch1.docx, HCatalogStreamingIngestFunctionalSpecificationandDesign- > apr 29- patch1.pdf, HIVE-4196.v1.patch > > > Motivation: Allow Hive users to immediately query data streaming in through > clients such as Flume. > Currently Hive partitions must be created after all the data for the > partition is available. Thereafter, data in the partitions is considered > immutable. > This proposal introduces the notion of a streaming partition into which new > files an be committed periodically and made available for queries before the > partition is closed and converted into a standard partition. > The admin enables streaming partition on a table using DDL. He provides the > following pieces of information: > - Name of the partition in the table on which streaming is enabled > - Frequency at which the streaming partition should be closed and converted > into a standard partition. > Tables with streaming partition enabled will be partitioned by one and only > one column. It is assumed that this column will contain a timestamp. > Closing the current streaming partition converts it into a standard > partition. Based on the specified frequency, the current streaming partition > is closed and a new one created for future writes. This is referred to as > 'rolling the partition'. > A streaming partition's life cycle is as follows: > - A new streaming partition is instantiated for writes > - Streaming clients request (via webhcat) for a HDFS file name into which > they can write a chunk of records for a specific table. > - Streaming clients write a chunk (via webhdfs) to that file and commit > it(via webhcat). Committing merely indicates that the chunk has been written > completely and ready for serving queries. > - When the partition is rolled, all committed chunks are swept into single > directory and a standard partition pointing to that directory is created. The > streaming partition is closed and new streaming partition is created. Rolling > the partition is atomic. Streaming clients are agnostic of partition rolling. > > - Hive queries will be able to query the partition that is currently open > for streaming. only committed chunks will be visible. read consistency will > be ensured so that repeated reads of the same partition will be idempotent > for the lifespan of the query. > Partition rolling requires an active agent/thread running to check when it is > time to roll and trigger the roll. This could be either be achieved by using > an external agent such as Oozie (preferably) or an internal agent. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira