stuhood commented on code in PR #22207:
URL: https://github.com/apache/datafusion/pull/22207#discussion_r3261654831


##########
datafusion/physical-expr/src/partitioning.rs:
##########
@@ -133,13 +136,225 @@ impl Display for Partitioning {
                     .join(", ");
                 write!(f, "Hash([{phy_exprs_str}], {size})")
             }
+            Partitioning::Range(range) => write!(f, "{range}"),
             Partitioning::UnknownPartitioning(size) => {
                 write!(f, "UnknownPartitioning({size})")
             }
         }
     }
 }
 
+/// Physical range partitioning.
+///
+/// [`RangePartitioning`] describes range bounds over one or more physical
+/// expressions. Each [`RangePartition`] represents one output partition and 
must
+/// contain exactly one [`RangeInterval`] for each partition expression.
+///
+/// The source declaring this partitioning is responsible for ensuring that, 
for
+/// every emitted row, the row belongs to exactly one partition and is emitted 
by
+/// that partition. The declared ranges do not need to cover values that the 
plan
+/// cannot emit.
+///
+/// Each lower and upper bound explicitly records whether it is inclusive.
+/// Unbounded sides are represented with `None`, bound values should be 
non-null
+/// until null routing semantics are defined.
+///
+/// For example, a scan can declare date and city range partitions as:
+///
+/// ```text
+/// exprs = [date, city]
+///
+/// partition 0:
+///   date in [2021-01-01, 2022-01-01)
+///   city in [Allston, Boston)
+///
+/// partition 1:
+///   date in [2021-01-01, 2022-01-01)
+///   city in [Boston, NYC)
+/// ```

Review Comment:
   Ok, interesting. Yea, if there are multiple consumers who are interested in 
multi-dimensional partitioning, and it can still reduce down to a base-case of 
single-dimension partitioning for consumers who don't need that complexity, 
then perhaps it could make sense to bake it in here.
   
   I'll be honest though: my largest concern is just that I have no experience 
with multi: only single. So I have less useful feedback to give.
   
   One thing that could likely be a good exercise in terms of the 
representation would be figuring out what datastructure you would/could use to 
efficiently partition in multiple dimensions, and then bias towards a 
representation which allows you to construct that datastructure. In one 
dimensional partitioning, that's essentially just a binary-tree/b-tree/sorted 
structure: hence the desire for non-overlapping contiguous ranges (to avoid 
needing something more complex like an interval tree). For multi-dimensional 
partitioning, what structure would you use, and what would the inputs to 
construct one be? I expect that fully covering the space makes that cheaper as 
well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to