[ https://issues.apache.org/jira/browse/KUDU-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848014#comment-17848014 ]
Alexey Serbin edited comment on KUDU-3577 at 5/21/24 12:46 AM: --------------------------------------------------------------- The root cause of the issue is storing the range partition information using {{RowOperationsPB}}: {noformat} message RangeWithHashSchemaPB { // Row operations containing the lower and upper range bound for the range. optional RowOperationsPB range_bounds = 1; // Hash schema for the range. repeated HashBucketSchemaPB hash_schema = 2; } {noformat} The problem is that the range partition boundaries information is stored in a serialized format, where the serialization depends on the current schema of the table, since the data in the serialized format looks like this: {noformat} [type_of_range_boundary][columns_set_bitmap][non_null_bitmap][encoded_range_key] {noformat} Essentially, the size of the 'columns_set_bitmap' and 'null_bitmap' depend on the total number of columns in the table and the number of nullable columns in the table correspondingly. Also, the latter field can be absent if there isn't any nullable column in the table -- that's exactly the case exposed by the reproduction scenario. The information should have been encoded independently of the table schema, similar to how the tablet's start/end ranges are encoded and stored. Alternatively, primary-key-only sub-schema should have been used to encode the range boundaries in the field of the {{RowOperationsPB}} type -- since the primary key is immutable for a table since its creation, the serialized representation wouldn't change on any allowable ALTER operations for the table. Since the feature is already released and we don't control the deployment of Kudu clients that use the original way of encoding/decoding the data, it adds compatibility constraints. The following approach preserves the backward compatibility (while it isn't optimal from the performance standpoint): # Upon processing ALTER table operations for adding and removing columns, it's necessary to check if the size of the columns-set bitmap and non-null bitmap changes after applying the ALTER operation on a table. # If the size of either of the bitmap changes, it's necessary to re-encode the information stored as {{PartitionSchemaPB::custom_hash_schema_ranges}} in the system catalog table on partition ranges of the affected table. The backwards-compatible approach above might still expose a gap when two clients are working with the same table and at least one altering the table by dropping/adding columns, but at least it's better than the current state when a table becomes non-accessible since its schema information becomes effectively corrupted under the conditions described above. was (Author: aserbin): The root cause of the issue is storing the range partition information using {{RowOperationsPB}}: {noformat} message RangeWithHashSchemaPB { // Row operations containing the lower and upper range bound for the range. optional RowOperationsPB range_bounds = 1; // Hash schema for the range. repeated HashBucketSchemaPB hash_schema = 2; } {noformat} The problem is that the range partition boundaries information is stored in a serialized format, where the serialization depends on the current schema of the table, since the data in the serialized format looks like this: {noformat} [type_of_range_boundary][columns_set_bitmap][non_null_bitmap][encoded_range_key] {noformat} Essentially, the size of the 'columns_set_bitmap' and 'null_bitmap' depend on the total number of columns in the table and the number of nullable columns in the table correspondingly. Also, the latter field can be absent if there isn't any nullable column in the table -- that's exactly the case exposed by the reproduction scenario. The information should have been encoded independently of the table schema, similar to how the tablet's start/end ranges are encoded and stored. Alternatively, an alternative primary-key-only schema should have been used to encode the range boundaries in the field of the {{RowOperationsPB}} type -- since the primary key is immutable for a table since its creation, the serialized representation wouldn't change on any allowable ALTER operations for the table. Since the feature is already released and we don't control the deployment of Kudu clients that use the original way of encoding/decoding the data, it adds compatibility constraints. The following approach preserves the backward compatibility (while it isn't optimal from the performance standpoint): # Upon processing ALTER table operations for adding and removing columns, it's necessary to check if the size of the columns-set bitmap and non-null bitmap changes after applying the ALTER operation on a table. # If the size of either of the bitmap changes, it's necessary to re-encode the information stored as {{PartitionSchemaPB::custom_hash_schema_ranges}} in the system catalog table on partition ranges of the affected table. The backwards-compatible approach above might still expose a gap when two clients are working with the same table and at least one altering the table by dropping/adding columns, but at least it's better than the current state when a table becomes non-accessible since its schema information becomes effectively corrupted under the conditions described above. > Dropping a nullable column from a table with per-range hash partitions make > the table unusable > ---------------------------------------------------------------------------------------------- > > Key: KUDU-3577 > URL: https://issues.apache.org/jira/browse/KUDU-3577 > Project: Kudu > Issue Type: Bug > Components: client, master, tserver > Affects Versions: 1.17.0 > Reporter: Alexey Serbin > Priority: Major > > For particular table schemas with per-range hash schemas, dropping a nullable > column from might make the table unusable. A workaround exists: just add the > dropped column back using the {{kudu table add_column}} CLI tool. For > example, for the reproduction scenario below, use the following command to > restore the access to the table's data: > {noformat} > $ kudu table add_column $M test city string > {noformat} > As for the reproduction scenario, see below for the sequence of {{kudu}} CLI > commands. > Set environment variable for the Kudu cluster's RPC endpoint: > {noformat} > $ export M=<master_RPC_address(es)> > {noformat} > Create a table with two range partitions. It's crucial that the {{city}} > column is nullable. > {noformat} > $ kudu table create $M '{ "table_name": "test", "schema": { "columns": [ { > "column_name": "id", "column_type": "INT64" }, { "column_name": "name", > "column_type": "STRING" }, { "column_name": "age", "column_type": "INT32" }, > { "column_name": "city", "column_type": "STRING", "is_nullable": true } ], > "key_column_names": ["id", "name", "age"] }, "partition": { > "hash_partitions": [ {"columns": ["id"], "num_buckets": 4, "seed": 1}, > {"columns": ["name"], "num_buckets": 4, "seed": 2} ], "range_partition": { > "columns": ["age"], "range_bounds": [ { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["30"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["60"]} }, { "lower_bound": {"bound_type": > "inclusive", "bound_values": ["60"]}, "upper_bound": {"bound_type": > "exclusive", "bound_values": ["90"]} } ] } }, "num_replicas": 1 }' > {noformat} > Add an extra range partition with custom hash schema: > {noformat} > $ kudu table add_range_partition $M test '[90]' '[120]' --hash_schema > '{"hash_schema": [ {"columns": ["id"], "num_buckets": 3, "seed": 5}, > {"columns": ["name"], "num_buckets": 3, "seed": 6} ]}' > {noformat} > Check the updated partitioning info: > {noformat} > $ kudu table describe $M test > TABLE test ( > id INT64 NOT NULL, > name STRING NOT NULL, > age INT32 NOT NULL, > city STRING NULLABLE, > PRIMARY KEY (id, name, age) > ) > HASH (id) PARTITIONS 4 SEED 1, > HASH (name) PARTITIONS 4 SEED 2, > RANGE (age) ( > PARTITION 30 <= VALUES < 60, > PARTITION 60 <= VALUES < 90, > PARTITION 90 <= VALUES < 120 HASH(id) PARTITIONS 3 HASH(name) PARTITIONS 3 > ) > OWNER root > REPLICAS 1 > COMMENT > {noformat} > Drop the {{city}} column: > {noformat} > $ kudu table delete_column $M test city > {noformat} > Now try to run the {{kudu table describe}} against the table once the > {{city}} column is dropped. It errors out with {{Invalid argument}}: > {noformat} > $ kudu table describe $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} > A similar issue manifests itself when trying to run {{kudu table scan}} > against the table: > {noformat} > $ kudu table scan $M test > Invalid argument: Invalid split row type UNKNOWN > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)