Re: [PR] perf: Introduce sort prefix computation for early TopK exit optimization on partially sorted input (10x speedup on top10 bench) [datafusion]

via GitHub Sat, 05 Apr 2025 08:03:04 -0700


geoffreyclaude commented on code in PR #15563:
URL: https://github.com/apache/datafusion/pull/15563#discussion_r2029892886



##########
datafusion/physical-plan/src/topk/mod.rs:
##########
@@ -681,4 +793,98 @@ mod tests {
         record_batch_store.unuse(0);
         assert_eq!(record_batch_store.batches_size, 0);
     }
+
+    /// This test validates that the `try_finish` method marks the TopK 
operator as finished
+    /// when the prefix (on column "a") of the last row in the current batch 
is strictly greater
+    /// than the worst top‑k row.
+    /// The full sort expression is defined on both columns ("a", "b"), but 
the input ordering is only on "a".
+    #[tokio::test]
+    async fn test_try_finish_marks_finished_with_prefix() -> Result<()> {
+        // Create a schema with two columns.
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("a", DataType::Int32, false),
+            Field::new("b", DataType::Float64, false),
+        ]));
+
+        // Create sort expressions.
+        // Full sort: first by "a", then by "b".
+        let sort_expr_a = PhysicalSortExpr {
+            expr: col("a", schema.as_ref())?,
+            options: SortOptions::default(),
+        };
+        let sort_expr_b = PhysicalSortExpr {
+            expr: col("b", schema.as_ref())?,
+            options: SortOptions::default(),
+        };
+
+        // Input ordering uses only column "a" (a prefix of the full sort).
+        let input_ordering = LexOrdering::from(vec![sort_expr_a.clone()]);
+        let full_expr = LexOrdering::from(vec![sort_expr_a, sort_expr_b]);
+
+        // Create a dummy runtime environment and metrics.
+        let runtime = Arc::new(RuntimeEnv::default());
+        let metrics = ExecutionPlanMetricsSet::new();
+
+        // Create a TopK instance with k = 3 and batch_size = 2.
+        let mut topk = TopK::try_new(
+            0,
+            Arc::clone(&schema),
+            input_ordering,
+            full_expr,
+            3,
+            2,
+            runtime,
+            &metrics,
+        )?;
+
+        // Create the first batch with two columns:
+        // Column "a": [1, 1, 2], Column "b": [10.0, 20.0, 10.0].

Review Comment:
   Nice catch, I had updated the `b` in the comment but not in the code... I 
think what we really want is the "comment" data:
   
   ```
   First batch:
    a | b
   ---+---
    1 | 10.0
    1 | 20.0
    2 | 10.0
   
   Second batch:
    a | b
   ---+---
    2 | 30.0
    3 | 10.0
   ```
   
   Like that in the top3 we get the first two rows from the first batch and the 
third row from the second batch. wdyt?
   I can also swap the first two rows of the first batch to enforce sorting as 
you suggested: `(1,20.0), (1, 10.0), (2, 10.0)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] perf: Introduce sort prefix computation for early TopK exit optimization on partially sorted input (10x speedup on top10 bench) [datafusion]

Reply via email to