Michael Smith created IMPALA-13966:
--------------------------------------

             Summary: Heavy scan concurrency on Parquet tables with large page 
size is slow
                 Key: IMPALA-13966
                 URL: https://issues.apache.org/jira/browse/IMPALA-13966
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
    Affects Versions: Impala 4.5.0
            Reporter: Michael Smith


When reading Parquet tables with large average page size under heavy scan 
concurrency, we see performance significantly slow down.

Impala writes Iceberg tables with its default page size of 64KB, unless 
{{write.parquet.page-size-bytes}} is explicitly set. The Iceberg library itself 
defaults to 1MB, and other tools - such as Spark - may use that default when 
writing tables.

I was able to distill an example that demonstrates a substantial difference in 
memory allocation performance for parquet reads when using 1MB page sizes, that 
is not present for 64KB pages.
# Get a machine with at least 32 real cores (not hyperthreaded) and an SSD.
# Create an Iceberg table with millions of rows containing a moderately long 
string (hundreds of characters) with a large page size; it's also helpful to 
create a version with the smaller page size. I used the following with and 
without {{write.parquet.page-size-bytes}} (iceberg_small_page) specified
{code:java}
create table iceberg_large_page stored by iceberg 
tblproperties('write.parquet.page-size-bytes'='1048576') as select *, 
repeat(l_comment, 10) from tpch.lineitem;
insert into iceberg_large_page select *, repeat(l_comment, 10) from 
tpch.lineitem;
insert into iceberg_large_page select *, repeat(l_comment, 10) from 
tpch.lineitem;
insert into iceberg_large_page select *, repeat(l_comment, 10) from 
tpch.lineitem;
insert into iceberg_large_page select *, repeat(l_comment, 10) from 
tpch.lineitem;{code}
# Restart Impala with {{-num_io_threads_per_solid_state_disk=32}} to increase 
read parallelism. The SSD should be able to handle it. The goal is to ensure we 
have as many scanners attempting to load and decompress data at the same time, 
with ideally concurrent memory allocation on every thread.
# Run a query that doesn't process much data outside the scan, and forces 
Impala to read every entry in the long string column
{code}
select _c1 from (select _c1, l_shipdate from iceberg_small_page where _c1 like 
"%toad%" UNION ALL select _c1, l_shipdate from iceberg_small_page where _c1 
like "%frog%") x ORDER BY l_shipdate LIMIT 10
{code}

I also added IMPALA-13487 to display ParquetDataPagePoolAllocDuration to 
simplify identifying slow allocation performance. One query was sufficient to 
show some difference in performance, with sufficient scanner threads to fully 
utilize all DiskIoMgr threads. The small page query had entries like
{code}
ParquetDataPagePoolAllocDuration: (Avg: 20.075us ; Min: 0.000ns ; Max: 65.999ms 
; Sum: 2s802ms ; Number of samples: 139620)
ParquetUncompressedPageSize: (Avg: 65.72 KB (67296) ; Min: 1.37 KB (1406) ; 
Max: 87.44 KB (89539) ; Sum: 6.14 GB (6590048444) ; Number of samples: 97926)
{code}
while the large page query had
{code}
ParquetDataPagePoolAllocDuration: (Avg: 2.753ms ; Min: 0.000ns ; Max: 64.999ms 
; Sum: 30s346ms ; Number of samples: 11022)
ParquetUncompressedPageSize: (Avg: 901.89 KB (923535) ; Min: 360.00 B (360) ; 
Max: 1.00 MB (1048583) ; Sum: 6.14 GB (6597738570) ; Number of samples: 7144)
{code}
ParquetUncompressedPageSize shows the difference in page sizes.

Our theory is that this represents thread contention attempting to access the 
global pool in tcmalloc. TCMalloc maintains per-thread pools for small amounts 
of memory - up to 256KB - but for larger chunks malloc goes to a global pool. 
If that's right, some possible options that could help are
1. Try to re-use buffers more across parquet reads, so we don't need to 
allocate memory as frequently.
2. Consider a different memory allocator for larger allocations.

This likely only impacts very high parallelism read-heavy queries. If each 
buffer is used in more processing, the cost of allocation should become a 
smaller part of the query time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to