GitHub user devoxi closed a discussion: How to improve Parquet reading 
performances?

Hello!

We've been experimenting in the last couple of days with Datafusion (31.0) and 
we've been comparing performances with our existing ClickHouse setup. To do so, 
we have exported a 5GB Parquet dataset, and building some UDFs we managed to 
replicate some of our queries.

In the end we are running a single quite simple query over the same Parquet 
dataset on the same mac with both Datafusion and ClickHouse. ClickHouse is 
always answering in about 700ms while Datafusion in 1.2s.

I've tried multiple settings, verified it was not our UDF causing it, checked 
there was no cache on ClickHouse, and I couldn't make it any faster with 
Datafusion. According to the EXPLAIN ANALYZE the poor performances are coming 
from the Parquet phase.

I have to confess that we are beginners in Rust and we might have missed 
something, hence this message.
Here is the EXPLAIN ANALYZE of our query, if it can help:
```
Plan with Metrics | ProjectionExec: expr=[SUM(my_table.sign)@0 as tcount, 
SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * 
my_table.sign)@1 as _ccount_1], metrics=[output_rows=1, elapsed_compute=584ns]
     AggregateExec: mode=Final, gby=[], aggr=[SUM(my_table.sign)@0 as tcount, 
SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * 
my_table.sign)], metrics=[output_rows=1, elapsed_compute=48.459µs]
        CoalescePartitionsExec, metrics=[output_rows=20, 
elapsed_compute=5.708µs]
            AggregateExec: mode=Partial, gby=[], aggr=[SUM(my_table.sign)@0 as 
tcount, SUM(my_udf(my_table.nested_field.array_column,List([custom_string])) * 
my_table.sign)], metrics=[output_rows=20, elapsed_compute=4.769958246s]
                ProjectionExec: expr=[sign@0 as sign, 
nested_field.array_column@3 as nested_field.array_column], 
metrics=[output_rows=6185527, elapsed_compute=396.238µs]
                    CoalesceBatchesExec: target_batch_size=8192, 
metrics=[output_rows=6185527, elapsed_compute=168.795215ms]
                        FilterExec: int_column@1 = 2 AND (CAST(string_column@2 
AS Utf8) !~* (.*.|)(word1|word2|word3|word4).*), metrics=[output_rows=6185527, 
elapsed_compute=652.683132ms]
                            ParquetExec: file_groups={20 groups: 
[[path/to/parquet/my_dataset.parquet:0..262363404], 
[path/to/parquet/my_dataset.parquet:262363404..524726808], 
[path/to/parquet/my_dataset.parquet:524726808..787090212], 
[path/to/parquet/my_dataset.parquet:787090212..1049453616], 
[path/to/parquet/my_dataset.parquet:1049453616..1311817020], ...]}, 
projection=[sign, int_column, string_column, nested_field.array_column], 
predicate=int_column@5 = 2 AND (CAST(string_column@79 AS Utf8) !~* 
(.*.|)(word1|word2|word3|word4).*), pruning_predicate=int_column_min@0 <= 2 AND 
2 <= int_column_max@1, metrics=[output_rows=6185527, elapsed_compute=20ns, 
page_index_rows_filtered=0, predicate_evaluation_errors=0, file_scan_errors=0, 
row_groups_pruned=13, num_predicate_creation_errors=0, file_open_errors=0, 
pushdown_rows_filtered=2880077, bytes_scanned=311061152, 
time_elapsed_processing=5.335386071s, pushdown_eval_time=781.00424ms, 
time_elapsed_scanning_until_data=1.472284126s, time_elapsed
 _opening=3.032638581s, time_elapsed_scanning_total=18.9360584s, 
page_index_eval_time=1.318µs]
```

We also noticed in some other queries that when having more Parquet files 
performances were much worse than in ClickHouse compared to a single Parquet 
file.

So is there anything we might have missed, that is general knowledge and could 
that lead to those performances?
Thanks for your help!



GitHub link: https://github.com/apache/datafusion/discussions/7737

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to