jayzhan211 commented on issue #11268:
URL: https://github.com/apache/datafusion/issues/11268#issuecomment-2212091914
> > `make_map_batch` version is way faster, we should go with that one.
> > Upd: I tried to remove clone in MapBuilder version, it still seems
slower than manually constructed one
> > Here's the revised version of the text with improved grammar:
>
> Thanks for your effort. I think `make_map_batch` is faster because it
accepts two known-type arrays. We only need to rearrange the layout for the
`MapArray`, which requires some `Arc` cloning. It's a good way to implement the
DuckDB syntax:
>
> ```
> SELECT MAP(['a', 'b', 'c'], [1, 2, 3]);
> ```
>
> However, I found it hard to provide a function like `make_map(k1, v1, k2,
v2 ...)` that is both efficient and simple. I tried two solutions for it:
>
> 1. Aggregate the keys and values, then pass them to `make_map_batch`:
> ```rust
> fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
> let mut key_buffer = VecDeque::new();
> let mut value_buffer = VecDeque::new();
>
> args.chunks_exact(2)
> .enumerate()
> .for_each(|(_, chunk)| {
> let key = chunk[0].clone();
> let value = chunk[1].clone();
> key_buffer.push_back(key);
> value_buffer.push_back(value);
> });
>
> let key: Vec<_> = key_buffer.into();
> let value: Vec<_> = value_buffer.into();
>
> let key = ColumnarValue::values_to_arrays(&key)?;
> let value = ColumnarValue::values_to_arrays(&value)?;
>
> make_map_batch(&[key[0].clone(), value[0].clone()])
> }
> ```
>
>
>
>
>
>
>
>
>
>
>
> The codebase is simple, but the performance is terrible. The bottleneck
is `ColumnarValue::values_to_arrays`. This function is convenient but slow.
Actually, this version is slower than the `MapBuilder` one. :(
> 2. MapBuilder:
> It's the faster solution (compared to the above one) for `make_map(k1,
v1, k2, v2 ...)`, but the codebase would become complex and large. In my
testing code, I hardcoded the types of keys and values to get the value and
create the builder. In a real scenario, we would need to match all types and
builders to use MapBuilder. I'm not sure if it's a good approach.
> ```rust
> fn make_map(args: &[ColumnarValue]) -> Result<ColumnarValue> {
> let string_builder = StringBuilder::new();
> let int_builder = Int32Builder::with_capacity(args.len() / 2);
> let mut builder =
> MapBuilder::new(None, string_builder, int_builder);
>
> for (_, chunk) in args.chunks_exact(2).enumerate() {
> let key = chunk[0].clone();
> let value = chunk[1].clone();
> match key {
> ColumnarValue::Scalar(ScalarValue::Utf8(Some(key_scalar)))
=> {
> builder.keys().append_value(key_scalar)
> }
> _ => {}
> }
>
> match value {
>
ColumnarValue::Scalar(ScalarValue::Int32(Some(value_scalar))) => {
> builder.values().append_value(value_scalar)
> }
> _ => {}
> }
> }
>
> Ok(ColumnarValue::Array(Arc::new(builder.finish())))
> }
> ```
>
> In conclusion, I will provide two functions:
>
> * For performance purposes: `map(['a', 'b', 'c'], [1, 2, 3])`
> * For user-friendly purposes: `make_map(k1, v1, k2, v2 ...)`
>
> For the user-friendly one, I'll choose the first solution to keep the
codebase simpler. I found that `named_struct` also uses
`ColumnarValue::values_to_arrays` to do something similar.
>
>
https://github.com/apache/datafusion/blob/08c5345e932f1c5c948751e0d06b1fd99e174efa/datafusion/functions/src/core/named_struct.rs#L60
>
> We could suggest that users who intend to get better performance or create
a large map use the `map()` function.
>
> Some appendixs for the benchmark result:
>
> ```
> // first run for warm-up
> Generating 1 random key-value pairs
> Time elapsed in make_map() is: 391.618µs
> Time elapsed in make_map_from() is: 224.568µs
> Time elapsed in make_map_batch() is: 18.173µs
>
> Generating 2 random key-value pairs
> Time elapsed in make_map() is: 19.848µs
> Time elapsed in make_map_from() is: 27.726µs
> Time elapsed in make_map_batch() is: 32.628µs
>
> Generating 4 random key-value pairs
> Time elapsed in make_map() is: 18.495µs
> Time elapsed in make_map_from() is: 37.102µs
> Time elapsed in make_map_batch() is: 16.399µs
>
> Generating 8 random key-value pairs
> Time elapsed in make_map() is: 20.785µs
> Time elapsed in make_map_from() is: 48.596µs
> Time elapsed in make_map_batch() is: 42.98µs
>
> Generating 10 random key-value pairs
> Time elapsed in make_map() is: 26.3µs
> Time elapsed in make_map_from() is: 84.601µs
> Time elapsed in make_map_batch() is: 15.55µs
>
> // it's not so common for a map use case
> Generating 50 random key-value pairs
> Time elapsed in make_map() is: 35.751µs
> Time elapsed in make_map_from() is: 332.336µs
> Time elapsed in make_map_batch() is: 52.425µs
>
> Generating 100 random key-value pairs
> Time elapsed in make_map() is: 178.775µs
> Time elapsed in make_map_from() is: 508.017µs
> Time elapsed in make_map_batch() is: 184.452µs
>
> Generating 1000 random key-value pairs
> Time elapsed in make_map() is: 386.535µs
> Time elapsed in make_map_from() is: 3.125679ms
> Time elapsed in make_map_batch() is: 45.549µs
> ```
what is your code for benchmarking?
Given the code, it seems you clone ColumnValue and push to VecDeque, we can
avoid it. Also, I think we can use Vec instead of VecDeque, since we don;t
need `push_front`.
```rust
fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
let mut key_buffer = VecDeque::new();
let mut value_buffer = VecDeque::new();
args.chunks_exact(2)
.enumerate()
.for_each(|(_, chunk)| {
let key = chunk[0].clone();
let value = chunk[1].clone();
key_buffer.push_back(key);
value_buffer.push_back(value);
});
let key: Vec<_> = key_buffer.into();
let value: Vec<_> = value_buffer.into();
let key = ColumnarValue::values_to_arrays(&key)?;
let value = ColumnarValue::values_to_arrays(&value)?;
make_map_batch(&[key[0].clone(), value[0].clone()])
}
```
You only take the first element, but shouldn't we process all the kv?
```rust
make_map_batch(&[key[0].clone(), value[0].clone()])
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]