duongcongtoai commented on issue #17446:
URL: https://github.com/apache/datafusion/issues/17446#issuecomment-3366427064
```
Benchmark 1: uv run polar.py sample-1m.parquet
Time (mean ± σ): 258.6 ms ± 32.2 ms [User: 514.8 ms, System: 218.1
ms]
Range (min … max): 238.5 ms … 348.3 ms 10 runs
Warning: The first benchmarking run for this command was significantly
slower than the rest (348.3 ms). This could be caused by (filesystem) caches
that were not filled until after the first run. You should consider using the
'--warmup' option to fill those caches before the actual benchmark.
Alternatively, use the '--prepare' option to clear the caches before each
timing run.
Benchmark 2: uv run df.py sample-1m.parquet
Time (mean ± σ): 345.3 ms ± 8.4 ms [User: 2194.9 ms, System:
241.7 ms]
Range (min … max): 331.7 ms … 360.7 ms 10 runs
Summary
uv run polar.py sample-1m.parquet ran
1.34 ± 0.17 times faster than uv run df.py sample-1m.parquet
```
There were significant improvement after using `interleave`, but there are
still some overflowing error writing to Parquet, i'm fixing that and push for
review soon
```
thread 'tokio-runtime-worker' (2611103) panicked at
/home/toai/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/arrow-data-56.1.0/src/transform/mod.rs:676:31:
MutableArrayData::new is infallible: DictionaryKeyOverflowError
stack backtrace:
0: __rustc::rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: arrow_data::transform::MutableArrayData::with_capacities
4: <alloc::vec::Vec<T> as
alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
5: arrow_data::transform::MutableArrayData::with_capacities
6: arrow_data::transform::MutableArrayData::with_capacities
7: arrow_select::interleave::interleave_fallback
8: <core::iter::adapters::GenericShunt<I,R> as
core::iter::traits::iterator::Iterator>::next
9:
datafusion_physical_plan::sorts::builder::BatchBuilder::build_record_batch
10: <datafusion_physical_plan::sorts::merge::SortPreservingMergeStream<C>
as futures_core::stream::Stream>::poll_next
11: datafusion_common_runtime::trace_utils::trace_future::{{closure}}
12: <futures_util::future::future::Map<Fut,F> as
core::future::future::Future>::poll
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]