Lordworms opened a new issue, #13298: URL: https://github.com/apache/datafusion/issues/13298
### Describe the bug I was doing https://github.com/apache/datafusion/pull/13054 and after rebase main I find out that the current parquet predicate pushdown may have some problem, it is 7 times slower than without a predicate <img width="1115" alt="image" src="https://github.com/user-attachments/assets/1b86cc71-c8f3-4061-8aae-260e33837014"> ### To Reproduce the script to generate the parquet file ```Python import pandas as pd import numpy as np part = pd.DataFrame({ 'p_partkey': np.random.randint(10, 21, size=1000), 'p_brand': np.random.choice(['Brand#1', 'Brand#2', 'Brand#3'], 1000), 'p_container': np.random.choice(['SM BOX', 'LG BOX', 'MED BOX'], 1000) }) lineitem = pd.DataFrame({ 'l_partkey': np.random.randint(1, 1000001, size=100000000), 'l_quantity': np.random.uniform(1, 50, size=100000000), 'l_extendedprice': np.random.uniform(100, 10000, size=100000000) }) part.to_parquet('/Users/yxiang1/Personal/code/datafusion/.vscode/part.parquet', index=False) lineitem.to_parquet('/Users/yxiang1/Personal/code/datafusion/.vscode/lineitem.parquet', index=False) ``` The results ``` > CREATE EXTERNAL TABLE lineitem ( l_partkey BIGINT, l_quantity DOUBLE, l_extendedprice DOUBLE ) STORED AS PARQUET LOCATION '/Users/yxiang1/Personal/code/datafusion/.vscode/lineitem.parquet'; 0 row(s) fetched. Elapsed 0.016 seconds. > set datafusion.execution.target_partitions = 2; 0 row(s) fetched. Elapsed 0.002 seconds. > select count(*) from linitem where l_partkey >= 10 and l_partkey <= 20; Error during planning: table 'datafusion.public.linitem' not found > select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20; +----------+ | count(*) | +----------+ | 1093 | +----------+ 1 row(s) fetched. Elapsed 7.037 seconds. > select count(*) from lineitem; +-----------+ | count(*) | +-----------+ | 100000000 | +-----------+ 1 row(s) fetched. Elapsed 1.022 seconds. > set datafusion.execution.parquet.pushdown_filters = true; 0 row(s) fetched. Elapsed 0.002 seconds. > select count(*) from lineitem; +-----------+ | count(*) | +-----------+ | 100000000 | +-----------+ 1 row(s) fetched. Elapsed 1.029 seconds. > select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20; +----------+ | count(*) | +----------+ | 1093 | +----------+ 1 row(s) fetched. Elapsed 6.748 seconds. ``` ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
