Hello Riza Suminto, Kurt Deschler, Xuebin Su, Csaba Ringhofer, Michael Smith, 
Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/22062

to look at the new patch set (#20).

Change subject: IMPALA-13487: Add profile counters for memory allocation
......................................................................

IMPALA-13487: Add profile counters for memory allocation

This patch adds some profile counters for memory allocation and free in
MemPools, which are useful to detect tcmalloc contention.

The following counters are added:
 - Thread level page faults: TotalThreadsMinorPageFaults,
   TotalThreadsMajorPageFaults.
 - MemPool counters for tuple_mem_pool and aux_mem_pool of the scratch
   batch in columnar scanners:
    - ScratchBatchMemAllocDuration
    - ScratchBatchMemFreeDuration
    - ScratchBatchMemAllocBytes
 - MemPool counters for data_page_pool of ParquetColumnChunkReader
    - ParquetDataPagePoolAllocBytes
    - ParquetDataPagePoolAllocDuration
    - ParquetDataPagePoolFreeBytes
    - ParquetDataPagePoolFreeDuration
 - MemPool counters for releasing memory in RowBatch::Reset() called in
   FragmentInstanceState::Exec:
    - RowBatchMemPoolFreeDuration
    - RowBatchMemPoolFreeBytes
 - Duration in HdfsColumnarScanner::GetCollectionMemory() which includes
   memory allocation for collection values and memcpy when doubling the
   tuple buffer:
    - MaterializeCollectionGetMemTime

Here is an example of a memory-bound query:
  Fragment Instance
    - RowBatchMemPoolFreeBytes: (Avg: 719.25 KB (736517) ; Min: 4.00 KB (4096) 
; Max: 4.12 MB (4321922) ; Sum: 1.93 GB (2069615013) ; Number of samples: 2810)
    - RowBatchMemPoolFreeDuration: (Avg: 132.027us ; Min: 0.000ns ; Max: 
21.999ms ; Sum: 370.997ms ; Number of samples: 2810)
    - TotalStorageWaitTime: 47.999ms
    - TotalThreadsInvoluntaryContextSwitches: 2 (2)
    - TotalThreadsMajorPageFaults: 0 (0)
    - TotalThreadsMinorPageFaults: 549.63K (549626)
    - TotalThreadsTotalWallClockTime: 9s646ms
      - TotalThreadsSysTime: 1s508ms
      - TotalThreadsUserTime: 1s791ms
    - TotalThreadsVoluntaryContextSwitches: 8.85K (8852)
    - TotalTime: 9s648ms
    ...
    HDFS_SCAN_NODE (id=0):
      - ParquetDataPagePoolAllocBytes: (Avg: 2.36 MB (2477480) ; Min: 4.00 KB 
(4096) ; Max: 4.12 MB (4321922) ; Sum: 1.02 GB (1090091508) ; Number of 
samples: 440)
      - ParquetDataPagePoolAllocDuration: (Avg: 1.263ms ; Min: 0.000ns ; Max: 
39.999ms ; Sum: 555.995ms ; Number of samples: 440)
      - ParquetDataPagePoolFreeBytes: (Avg: 1.28 MB (1344350) ; Min: 4.00 KB 
(4096) ; Max: 1.53 MB (1601012) ; Sum: 282.06 MB (295757000) ; Number of 
samples: 220)
      - ParquetDataPagePoolFreeDuration: (Avg: 1.927ms ; Min: 0.000ns ; Max: 
19.999ms ; Sum: 423.996ms ; Number of samples: 220)
      - ScratchBatchMemAllocBytes: (Avg: 486.33 KB (498004) ; Min: 4.00 KB 
(4096) ; Max: 512.00 KB (524288) ; Sum: 1.19 GB (1274890240) ; Number of 
samples: 2560)
      - ScratchBatchMemAllocDuration: (Avg: 1.936ms ; Min: 0.000ns ; Max: 
35.999ms ; Sum: 4s956ms ; Number of samples: 2560)
      - ScratchBatchMemFreeDuration: 0.000ns (Number of samples: 0)
      - DecompressionTime: 1s396ms
      - MaterializeCollectionGetMemTime: 4s899ms
      - MaterializeTupleTime: 6s656ms
      - ScannerIoWaitTime: 47.999ms
      - TotalRawHdfsOpenFileTime: 0.000ns
      - TotalRawHdfsReadTime: 360.997ms
      - TotalTime: 9s254ms

The fragment instance took 9s648ms to finish. 370.997ms spent in
releasing memory of the final RowBatch. The majority of the time is
spent in the scan node (9s254ms). Mostly it's DecompressionTime +
MaterializeTupleTime + ScannerIoWaitTime + TotalRawHdfsReadTime. The
majority is MaterializeTupleTime (6s656ms).

ScratchBatchMemAllocDuration shows that invoking std::malloc() in
materializing the scratch batches took 4s956ms overall.
MaterializeCollectionGetMemTime shows that allocating memory for
collections and copying memory in doubling the tuple buffer took
4s899ms. So materializing the collections took most of the time.

Note that DecompressionTime (1s396ms) also includes memory allocation
duration tracked by the sum of ParquetDataPagePoolAllocDuration
(555.995ms). So memory allocation also takes a significant portion of
time here.

The other observation is TotalThreadsTotalWallClockTime is much higher
than TotalThreadsSysTime + TotalThreadsUserTime and there is a large
number of TotalThreadsVoluntaryContextSwitches. So the thread is waiting
for resources (e.g. lock) for a long duration. In the above case, it's
waiting for locks in tcmalloc memory allocation (need off-cpu flame
graph to reveal this).

Implementation of MemPool counters
Add MemPoolCounters in MemPool to track malloc/free duration and bytes.
Note that counters are not updated in the destructor since it's expected
that all chunks are freed or transfered before calling the destructor.

MemPool is widely used in the code base. This patch only exposes MemPool
counters in three places:
 - the scratch batch in columnar scanners
 - the ParquetColumnChunkReader of parquet scanners
 - the final RowBatch reset by FragmentInstanceState

This patch also moves GetCollectionMemory() from HdfsScanner to
HdfsColumnarScanner since it's only used by parquet and orc scanners.

PrettyPrint of SummaryStatsCounter is updated to also show the sum of
the values if it's not speeds or percentages.

Tests
 - tested in manually reproducing the memory-bound queries
 - ran perf-AB-test on tpch (sf=42) and didn't see significant
   performance change
 - added e2e tests

Change-Id: I982315d96e6de20a3616f3bd2a2b4866d1ff4710
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/scratch-tuple-batch.h
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/mem-pool.cc
M be/src/runtime/mem-pool.h
M be/src/runtime/row-batch.h
M be/src/util/impala-profile-tool.cc
M be/src/util/runtime-profile-counters.h
M be/src/util/runtime-profile.cc
M tests/query_test/test_observability.py
17 files changed, 257 insertions(+), 36 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/22062/20
--
To view, visit http://gerrit.cloudera.org:8080/22062
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I982315d96e6de20a3616f3bd2a2b4866d1ff4710
Gerrit-Change-Number: 22062
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Xuebin Su <x...@cloudera.com>

Reply via email to