[ https://issues.apache.org/jira/browse/IMPALA-13966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944467#comment-17944467 ]
Michael Smith commented on IMPALA-13966: ---------------------------------------- [~csringhofer] [~stigahuang] [~joemcdonnell] I think you've all looked at similar problems to this. Any other or more specific ideas about how to approach it? Or tickets I've overlooked? > Heavy scan concurrency on Parquet tables with large page size is slow > --------------------------------------------------------------------- > > Key: IMPALA-13966 > URL: https://issues.apache.org/jira/browse/IMPALA-13966 > Project: IMPALA > Issue Type: Bug > Components: Backend > Affects Versions: Impala 4.5.0 > Reporter: Michael Smith > Priority: Major > > When reading Parquet tables with large average page size under heavy scan > concurrency, we see performance significantly slow down. > Impala writes Iceberg tables with its default page size of 64KB, unless > {{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library > itself defaults to 1MB, and other tools - such as Spark - may use that > default when writing tables. > I was able to distill an example that demonstrates a substantial difference > in memory allocation performance for parquet reads when using 1MB page sizes, > that is not present for 64KB pages. > # Get a machine with at least 32 real cores (not hyperthreaded) and an SSD. > # Create an Iceberg table with millions of rows containing a moderately long > string (hundreds of characters) with a large page size; it's also helpful to > create a version with the smaller page size. I used the following with and > without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified > {code:java} > create table iceberg_large_page stored by iceberg > tblproperties('write.parquet.page-size-bytes'='1048576') as select *, > repeat(l_comment, 10) from tpch.lineitem; > insert into iceberg_large_page select *, repeat(l_comment, 10) from > tpch.lineitem; > insert into iceberg_large_page select *, repeat(l_comment, 10) from > tpch.lineitem; > insert into iceberg_large_page select *, repeat(l_comment, 10) from > tpch.lineitem; > insert into iceberg_large_page select *, repeat(l_comment, 10) from > tpch.lineitem;{code} > # Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase > read parallelism. The SSD should be able to handle it. The goal is to ensure > we have as many scanners attempting to load and decompress data at the same > time, with ideally concurrent memory allocation on every thread. > # Run a query that doesn't process much data outside the scan, and forces > Impala to read every entry in the long string column > {code} > select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1 > like "%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where > _c1 like "%frog%") x ORDER BY l_shipdate LIMIT 10 > {code} > I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to > simplify identifying slow allocation performance. One query was sufficient to > show some difference in performance, with sufficient scanner threads to fully > utilize all DiskIoMgr threads. The small page query had entries like > {code} > ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max: > 65.999ms ; Sum: 2s802ms ; Number of samples: 139620) > ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ; > Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926) > {code} > while the large page query had > {code} > ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max: > 64.999ms ; Sum: 30s346ms ; Number of samples: 11022) > ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ; > Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144) > {code} > ParquetUncompressedPageSize shows the difference in page sizes. > Our theory is that this represents thread contention attempting to access the > global pool in tcmalloc. TCMalloc maintains per-thread pools for small > amounts of memory - up to 256KB - but for larger chunks malloc goes to a > global pool. If that's right, some possible options that could help are > 1. Try to re-use buffers more across parquet reads, so we don't need to > allocate memory as frequently. > 2. Consider a different memory allocator for larger allocations. > This likely only impacts very high parallelism read-heavy queries. If each > buffer is used in more processing, the cost of allocation should become a > smaller part of the query time. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org