GitHub user matt-martin closed a discussion: Issues with repartition

Hello,

I'm fairly new to Rust and Datafusion so please excuse the basic question. I'm 
trying to understand why repartition does not always produce the desired number 
of partitions. Here's a very basic test I constructed:

```rust
#[cfg(test)]
mod tests {
    use super::*;

    #[tokio::test]
    async fn repartition_test() {
        let data_vec = (1..200).map(|x| x.to_string()).collect::<Vec<_>>();

        let batch = RecordBatch::try_new(
            Arc::new(Schema::new(vec![Field::new("foo", DataType::Utf8, 
false)])),
            vec![Arc::new(StringArray::from(data_vec))]
        ).unwrap();

        let result =  SessionContext::new().read_batch(batch)
            .unwrap()
            .repartition(Partitioning::Hash(vec![col("foo")], 3))
            .unwrap()
            .collect_partitioned()
            .await
            .unwrap();

        print!("RESULTS look like: {:?}", result);
        assert_eq!(result.len(), 3);
    }
}
```

If I run the test:

```sh
cargo test -- tests::repartition_test --exact
```

I see the following output:

```sh
running 1 test
test tests::repartition_test ... FAILED

failures:

---- tests::repartition_test stdout ----
RESULTS look like: [[RecordBatch { schema: Schema { fields: [Field { name: 
"foo", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, 
metadata: {} }], metadata: {} }, columns: [StringArray
[
  "1",
  "2",
  "3",
  "4",
  "5",
  "6",
  "7",
  "8",
  "9",
  "10",
  ...179 elements...,
  "190",
  "191",
  "192",
  "193",
  "194",
  "195",
  "196",
  "197",
  "198",
  "199",
]], row_count: 199 }]]thread 'tests::repartition_test' panicked at 
src/main.rs:808:9:
assertion `left == right` failed
  left: 1
 right: 3
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    tests::repartition_test

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; 
finished in 0.02s
```

I'm probably missing something obvious, but shouldn't the top level vector 
returned by collect_partitioned have 3 elements (i.e. 1 for each partition)?

GitHub link: https://github.com/apache/datafusion/discussions/9701

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to