kosiew commented on code in PR #16340: URL: https://github.com/apache/datafusion/pull/16340#discussion_r2139532304
########## docs/source/library-user-guide/table-constraints.md: ########## @@ -0,0 +1,46 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Table Constraint Enforcement + +Table providers can describe table constraints using the +[`TableConstraint`] and [`Constraints`] APIs. These constraints include +primary keys, unique keys, foreign keys and check constraints. + +DataFusion does **not** currently enforce these constraints at runtime. +They are provided for informational purposes and can be used by custom +`TableProvider` implementations or other parts of the system. + +- **Nullability**: The only property enforced by DataFusion is the + nullability of each [`Field`] in a schema. Columns marked as not + nullable should not produce null values during execution. DataFusion + does not check this when data is ingested. +- **Primary and unique keys**: DataFusion does not verify that the data + satisfies primary or unique key constraints. Table providers that + require this behaviour must implement their own checks. +- **Foreign keys and check constraints**: These constraints are parsed + but are not validated or used during query planning. + +The optimizer also does not assume that these constraints hold when +rewriting queries. For example, declaring a column as a primary key will +not allow the optimizer to skip a `DISTINCT` aggregation. Review Comment: hi @alamb, You're right. I tested this in datafusion-cli ```sql -- Test 1: Create table with more data to see if DISTINCT appears CREATE TABLE test_pk_large ( id INTEGER PRIMARY KEY, name VARCHAR(50) ); -- Insert duplicate names but unique IDs INSERT INTO test_pk_large VALUES (1, 'Alice'), (2, 'Alice'), (3, 'Bob'), (4, 'Bob'), (5, 'Charlie'); -- Test DISTINCT on primary key column EXPLAIN SELECT DISTINCT id FROM test_pk_large; +---------------+-------------------------------+ | plan_type | plan | +---------------+-------------------------------+ | physical_plan | ┌───────────────────────────┐ | | | │ DataSourceExec │ | | | │ -------------------- │ | | | │ bytes: 376 │ | | | │ format: memory │ | | | │ rows: 1 │ | | | └───────────────────────────┘ | | | | +---------------+-------------------------------+ -- Test 2 CREATE TABLE test_no_pk ( id INTEGER, name VARCHAR(50) ); -- Insert unique IDs (same as before) INSERT INTO test_no_pk VALUES (1, 'Alice'), (2, 'Alice'), (3, 'Bob'), (4, 'Bob'), (5, 'Charlie'); EXPLAIN SELECT DISTINCT id FROM test_no_pk; +---------------+-------------------------------+ | plan_type | plan | +---------------+-------------------------------+ | physical_plan | ┌───────────────────────────┐ | | | │ AggregateExec │ | | | │ -------------------- │ | | | │ group_by: id │ | | | │ │ | | | │ mode: │ | | | │ FinalPartitioned │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ CoalesceBatchesExec │ | | | │ -------------------- │ | | | │ target_batch_size: │ | | | │ 8192 │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ RepartitionExec │ | | | │ -------------------- │ | | | │ partition_count(in->out): │ | | | │ 10 -> 10 │ | | | │ │ | | | │ partitioning_scheme: │ | | | │ Hash([id@0], 10) │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ RepartitionExec │ | | | │ -------------------- │ | | | │ partition_count(in->out): │ | | | │ 1 -> 10 │ | | | │ │ | | | │ partitioning_scheme: │ | | | │ RoundRobinBatch(10) │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ AggregateExec │ | | | │ -------------------- │ | | | │ group_by: id │ | | | │ mode: Partial │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ DataSourceExec │ | | | │ -------------------- │ | | | │ bytes: 376 │ | | | │ format: memory │ | | | │ rows: 1 │ | | | └───────────────────────────┘ | | | | +---------------+-------------------------------+ ``` In other words, the optimization plan seems to depend on the declared constraints. I'll remove this paragraph. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org