Re: [PR] Fix constant window for evaluate stateful [datafusion]

via GitHub Wed, 18 Jun 2025 14:21:05 -0700


alamb commented on PR #16430:
URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2985661485


   I tried making a reproducer but I could not reproduce the wrong results or 
panic reported in @andygrove 's comment 
https://github.com/apache/datafusion/issues/16308#issuecomment-2949516445:
   
   Here is what I tried:
   
   Data: [tenk.csv](https://github.com/user-attachments/files/20804065/tenk.csv)
   
   Repro
   ```sql
   create external table tenk1
   (
   unique1 int,
   unique2 int,
   two int,
   four int,
   ten int,
   twenty int,
   hundred int,
   thousand int,
   twothousand int,
   fivethous int,
   tenthous int,
   odd int,
   even int,
   stringu1 string,
   stringu2 string,
   string4 string
   )
   stored as CSV location 'tenk.csv'
   OPTIONS('has_header' 'false','format.delimiter' 9);
   
   SELECT * from tenk1 limit 10;
   
   SELECT COUNT(*) OVER () FROM tenk1 WHERE unique2 < 10
   ```
   
   But that seems to work just fine:
   
   ```sql
   (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f 
repro.sql
   DataFusion CLI v48.0.0
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------+
   | unique1 | unique2 | two | four | ten | twenty | hundred | thousand | 
twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
|
   
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------+
   | 8800    | 0       | 0   | 0    | 0   | 0      | 0       | 800      | 800   
      | 3800      | 8800     | 0   | 1    | MAAAAA   | AAAAAA   | AAAAxx  |
   | 1891    | 1       | 1   | 3    | 1   | 11     | 91      | 891      | 1891  
      | 1891      | 1891     | 182 | 183  | TUAAAA   | BAAAAA   | HHHHxx  |
   | 3420    | 2       | 0   | 0    | 0   | 0      | 20      | 420      | 1420  
      | 3420      | 3420     | 40  | 41   | OBAAAA   | CAAAAA   | OOOOxx  |
   | 9850    | 3       | 0   | 2    | 0   | 10     | 50      | 850      | 1850  
      | 4850      | 9850     | 100 | 101  | WOAAAA   | DAAAAA   | VVVVxx  |
   | 7164    | 4       | 0   | 0    | 4   | 4      | 64      | 164      | 1164  
      | 2164      | 7164     | 128 | 129  | OPAAAA   | EAAAAA   | AAAAxx  |
   | 8009    | 5       | 1   | 1    | 9   | 9      | 9       | 9        | 9     
      | 3009      | 8009     | 18  | 19   | BWAAAA   | FAAAAA   | HHHHxx  |
   | 5057    | 6       | 1   | 1    | 7   | 17     | 57      | 57       | 1057  
      | 57        | 5057     | 114 | 115  | NMAAAA   | GAAAAA   | OOOOxx  |
   | 6701    | 7       | 1   | 1    | 1   | 1      | 1       | 701      | 701   
      | 1701      | 6701     | 2   | 3    | TXAAAA   | HAAAAA   | VVVVxx  |
   | 4321    | 8       | 1   | 1    | 1   | 1      | 21      | 321      | 321   
      | 4321      | 4321     | 42  | 43   | FKAAAA   | IAAAAA   | AAAAxx  |
   | 3043    | 9       | 1   | 3    | 3   | 3      | 43      | 43       | 1043  
      | 3043      | 3043     | 86  | 87   | BNAAAA   | JAAAAA   | HHHHxx  |
   
+---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------+
   10 row(s) fetched.
   Elapsed 0.007 seconds.
   
   +-------------------------------------------------------------------+
   | count(*) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
   +-------------------------------------------------------------------+
   | 10                                                                |
   | 10                                                                |
   | 10                                                                |
   | 10                                                                |
   | 10                                                                |
   | 10                                                                |
   | 10                                                                |
   | 10                                                                |
   | 10                                                                |
   | 10                                                                |
   +-------------------------------------------------------------------+
   10 row(s) fetched.
   Elapsed 0.004 seconds.
   ```
   
   Notes for myself of where this came from:
   
   
https://github.com/apache/spark/blob/a38d1cef73eda8ab765dc168284b9c113c237a8e/sql/core/src/test/resources/sql-tests/inputs/postgreSQL/window_part1.sql#L50
   
   ```sql
   SELECT COUNT(*) OVER () FROM tenk1 WHERE unique2 < 10
   ```
   
   I did some digging and found the table definition is
   
https://github.com/apache/spark/blob/a38d1cef73eda8ab765dc168284b9c113c237a8e/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L536-L562
   
   ```
       session
         .read
         .format("csv")
         .options(Map("delimiter" -> "\t", "header" -> "false"))
         .schema(
           """
             |unique1 int,
             |unique2 int,
             |two int,
             |four int,
             |ten int,
             |twenty int,
             |hundred int,
             |thousand int,
             |twothousand int,
             |fivethous int,
             |tenthous int,
             |odd int,
             |even int,
             |stringu1 string,
             |stringu2 string,
             |string4 string
           """.stripMargin)
         .load(testFile("test-data/postgresql/onek.data"))
   ```
   
   The data is here: 
https://github.com/apache/spark/blob/a38d1cef73eda8ab765dc168284b9c113c237a8e/sql/core/src/test/resources/test-data/postgresql/tenk.data
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix constant window for evaluate stateful [datafusion]

Reply via email to