Re: [PR] Minor: Add routine to debug join fuzz tests [datafusion]

via GitHub Tue, 18 Jun 2024 07:50:34 -0700


alamb commented on code in PR #10970:
URL: https://github.com/apache/datafusion/pull/10970#discussion_r1644601726



##########
datafusion/core/tests/fuzz_cases/join_fuzz.rs:
##########
@@ -383,7 +419,7 @@ impl JoinFuzzTestCase {
 
     /// Perform sort-merge join and hash join on same input
     /// and verify two outputs are equal
-    async fn run_test(&self) {
+    async fn run_test(&self, join_test: &[JoinTest], debug: bool) {

Review Comment:
   Since this is already on a struct, another way to pass in the `debug` flag 
would be as a new field on `JoinFuzzTestCase`
   
   
   like
   ```rust
   struct JoinFuzzTestCase {
       batch_sizes: &'static [usize],
       input1: Vec<RecordBatch>,
       input2: Vec<RecordBatch>,
       join_type: JoinType,
       join_filter_builder: Option<JoinFilterBuilder>,
       // if debug flag is set the test will save randomly generated inputs and 
outputs to user folders
       // so it is easy to run debug on top of the failed test data
       debug: bool,
   }
   ```
   



##########
datafusion/core/tests/fuzz_cases/join_fuzz.rs:
##########
@@ -394,17 +430,31 @@ impl JoinFuzzTestCase {
             let hj = self.hash_join();
             let hj_collected = collect(hj, task_ctx.clone()).await.unwrap();
 
+            let nlj = self.nested_loop_join();
+            let nlj_collected = collect(nlj, task_ctx.clone()).await.unwrap();
+
             // Get actual row counts(without formatting overhead) for HJ and 
SMJ
             let hj_rows = hj_collected.iter().fold(0, |acc, b| acc + 
b.num_rows());
             let smj_rows = smj_collected.iter().fold(0, |acc, b| acc + 
b.num_rows());
+            let nlj_rows = nlj_collected.iter().fold(0, |acc, b| acc + 
b.num_rows());
+
+            // if debug flag is set the test will save randomly generated 
inputs and outputs to user folders
+            // so it is easy to run debug on top of the failed test data
+            if debug {

Review Comment:
   👍 
   
   It might make sense to add this information about saving information to the 
overall docstring of this function (`run_tests`) so people don't have to read 
the implementation to find out what debug does.



##########
datafusion/core/tests/fuzz_cases/join_fuzz.rs:
##########
@@ -425,35 +475,106 @@ impl JoinFuzzTestCase {
                 nlj_formatted.trim().lines().collect();
             nlj_formatted_sorted.sort_unstable();
 
-            // row level compare if any of joins returns the result
-            // the reason is different formatting when there is no rows
-            if smj_rows > 0 || hj_rows > 0 {
-                for (i, (smj_line, hj_line)) in smj_formatted_sorted
+            if join_test.contains(&JoinTest::NljHj) {
+                let err_msg_rowcnt = format!("NestedLoopJoinExec and 
HashJoinExec produced different row counts, batch_size: {}", batch_size);
+                assert_eq!(nlj_rows, hj_rows, "{}", err_msg_rowcnt.as_str());
+
+                let err_msg_contents = format!("NestedLoopJoinExec and 
HashJoinExec produced different results, batch_size: {}", batch_size);
+
+                // row level compare if any of joins returns the result
+                // the reason is different formatting when there is no rows
+                for (i, (nlj_line, hj_line)) in nlj_formatted_sorted
                     .iter()
                     .zip(&hj_formatted_sorted)
                     .enumerate()
                 {
                     assert_eq!(
-                        (i, smj_line),
+                        (i, nlj_line),
                         (i, hj_line),
-                        "SortMergeJoinExec and HashJoinExec produced different 
results"
+                        "{}",
+                        err_msg_contents.as_str()
                     );
                 }
             }
 
-            for (i, (nlj_line, hj_line)) in nlj_formatted_sorted
-                .iter()
-                .zip(&hj_formatted_sorted)
-                .enumerate()
-            {
-                assert_eq!(
-                    (i, nlj_line),
-                    (i, hj_line),
-                    "NestedLoopJoinExec and HashJoinExec produced different 
results"
-                );
+            if join_test.contains(&JoinTest::HjSmj) {
+                let err_msg_row_cnt = format!("HashJoinExec and 
SortMergeJoinExec produced different row counts, batch_size: {}", &batch_size);
+                assert_eq!(hj_rows, smj_rows, "{}", err_msg_row_cnt.as_str());
+
+                let err_msg_contents = format!("SortMergeJoinExec and 
HashJoinExec produced different results, batch_size: {}", &batch_size);
+
+                // row level compare if any of joins returns the result
+                // the reason is different formatting when there is no rows
+                if smj_rows > 0 || hj_rows > 0 {
+                    for (i, (smj_line, hj_line)) in smj_formatted_sorted
+                        .iter()
+                        .zip(&hj_formatted_sorted)
+                        .enumerate()
+                    {
+                        assert_eq!(
+                            (i, smj_line),
+                            (i, hj_line),
+                            "{}",
+                            err_msg_contents.as_str()
+                        );
+                    }
+                }
             }
         }
     }
+
+    /// This method useful for debugging fuzz tests
+    /// It helps to save randomly generated input test data for both join 
inputs into the user folder
+    /// as a parquet files preserving partitioning.
+    /// Once the data is saved it is possible to run a custom test on top of 
the saved data and debug

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Minor: Add routine to debug join fuzz tests [datafusion]

Reply via email to