[ 
https://issues.apache.org/jira/browse/IMPALA-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944467#comment-17944467
 ] 

Michael Smith commented on IMPALA-13966:
----------------------------------------

[~csringhofer] [~stigahuang] [~joemcdonnell] I think you've all looked at 
similar problems to this. Any other or more specific ideas about how to 
approach it? Or tickets I've overlooked?

> Heavy scan concurrency on Parquet tables with large page size is slow
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-13966
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13966
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.5.0
>            Reporter: Michael Smith
>            Priority: Major
>
> When reading Parquet tables with large average page size under heavy scan 
> concurrency, we see performance significantly slow down.
> Impala writes Iceberg tables with its default page size of 64KB, unless 
> {{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library 
> itself defaults to 1MB, and other tools - such as Spark - may use that 
> default when writing tables.
> I was able to distill an example that demonstrates a substantial difference 
> in memory allocation performance for parquet reads when using 1MB page sizes, 
> that is not present for 64KB pages.
> # Get a machine with at least 32 real cores (not hyperthreaded) and an SSD.
> # Create an Iceberg table with millions of rows containing a moderately long 
> string (hundreds of characters) with a large page size; it's also helpful to 
> create a version with the smaller page size. I used the following with and 
> without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified
> {code:java}
> create table iceberg_large_page stored by iceberg 
> tblproperties('write.parquet.page-size-bytes'='1048576') as select *, 
> repeat(l_comment, 10) from tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;
> insert into iceberg_large_page select *, repeat(l_comment, 10) from 
> tpch.lineitem;{code}
> # Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase 
> read parallelism. The SSD should be able to handle it. The goal is to ensure 
> we have as many scanners attempting to load and decompress data at the same 
> time, with ideally concurrent memory allocation on every thread.
> # Run a query that doesn't process much data outside the scan, and forces 
> Impala to read every entry in the long string column
> {code}
> select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1 
> like "%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where 
> _c1 like "%frog%") x ORDER BY l_shipdate LIMIT 10
> {code}
> I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to 
> simplify identifying slow allocation performance. One query was sufficient to 
> show some difference in performance, with sufficient scanner threads to fully 
> utilize all DiskIoMgr threads. The small page query had entries like
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max: 
> 65.999ms ; Sum: 2s802ms ; Number of samples: 139620)
> ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ; 
> Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926)
> {code}
> while the large page query had
> {code}
> ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max: 
> 64.999ms ; Sum: 30s346ms ; Number of samples: 11022)
> ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ; 
> Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144)
> {code}
> ParquetUncompressedPageSize shows the difference in page sizes.
> Our theory is that this represents thread contention attempting to access the 
> global pool in tcmalloc. TCMalloc maintains per-thread pools for small 
> amounts of memory - up to 256KB - but for larger chunks malloc goes to a 
> global pool. If that's right, some possible options that could help are
> 1. Try to re-use buffers more across parquet reads, so we don't need to 
> allocate memory as frequently.
> 2. Consider a different memory allocator for larger allocations.
> This likely only impacts very high parallelism read-heavy queries. If each 
> buffer is used in more processing, the cost of allocation should become a 
> smaller part of the query time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to