[ https://issues.apache.org/jira/browse/KUDU-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425884#comment-17425884 ]
ASF subversion and git services commented on KUDU-2671: ------------------------------------------------------- Commit a50091e2d4509feac2f29128107102ec52fcb7b0 in kudu's branch refs/heads/master from Alexey Serbin [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=a50091e ] KUDU-2671 introduce PartitionKey This changelist introduces a dedicated PartitionKey data structure to represent a table's partition key. The crux is in keeping the hash and the range part of the key separately, so it's possible to properly work with such an entity without the context of a particular range or tablet. Before this patch, encoded partition keys were represented as strings that had the hash and the range parts concatenated: it was always possible to properly decode such a key having only table-wide information in the context because every range had the same hash schema. However, once custom hash schemas per range were introduced, now it's impossible to properly decode such a compound key without knowing the particular hash schema that was used to encode the key. The ordering of PartitionKeys in this changelist follows the legacy notation for comparison operator, first concatenating the hash and the range parts and then comparing the result strings. However, it's going to change in a follow-up changelist. That change induces corresponding changes in PartitionPruner. Also, the updated comparison operator for PartitionKey requires updating the code in the catalog manager and in the client's meta-cache. I decided to split the changes into a few changelists for easier tracking and reviewing. As a by-product of this change, the following methods of the PartitionSchema class started working as needed in case of a table with per-range custom hash schemas: * PartitionKeyDebugStringImpl() * PartitionKeyDebugString() I was also thinking about introducing an extra field to ScanTokenPB, like the newly added GetTableLocationsRequestPB::partition_key_range field, or rely on the ScanTokenPB::tablet_metadata field to extract the hash and the range parts out of the strings representing lower and upper scan boundaries, but I realized that deserves to be done in a separate changelist. Change-Id: I00255ec404beeb999117f5265de0d5d8deaf0d68 Reviewed-on: http://gerrit.cloudera.org:8080/17890 Tested-by: Kudu Jenkins Reviewed-by: Andrew Wong <aw...@cloudera.com> > Change hash number for range partitioning > ----------------------------------------- > > Key: KUDU-2671 > URL: https://issues.apache.org/jira/browse/KUDU-2671 > Project: Kudu > Issue Type: Improvement > Components: client, java, master, server > Affects Versions: 1.8.0 > Reporter: yangz > Assignee: Mahesh Reddy > Priority: Major > Labels: feature, roadmap-candidate, scalability > Attachments: 屏幕快照 2019-01-24 下午12.03.41.png > > > For our usage, the kudu schema design isn't flexible enough. > We create our table for day range such as dt='20181112' as hive table. > But our data size change a lot every day, for one day it will be 50G, but for > some other day it will be 500G. For this case, it be hard to set the hash > schema. If too big, for most case, it will be too wasteful. But too small, > there is a performance problem in the case of a large amount of data. > > So we suggest a solution we can change the hash number by the history data of > a table. > for example > # we create schema with one estimated value. > # we collect the data size by day range > # we create new day range partition by our collected day size. > We use this feature for half a year, and it work well. We hope this feature > will be useful for the community. Maybe the solution isn't so complete. > Please help us make it better. -- This message was sent by Atlassian Jira (v8.3.4#803005)