Hello Riza Suminto, Xuebin Su, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/22062 to look at the new patch set (#17). Change subject: IMPALA-13487: Add profile counters for memory allocation ...................................................................... IMPALA-13487: Add profile counters for memory allocation This patch adds some profile counters for memory allocation and free in MemPools, which are useful to detect tcmalloc contention. The following counters are added: - Thread level page faults: TotalThreadsMinorPageFaults, TotalThreadsMajorPageFaults. - MemPool counters for tuple_mem_pool and aux_mem_pool of the scratch batch in columnar scanners: - ScratchBatchMemAllocDuration - ScratchBatchMemFreeDuration - ScratchBatchMemAllocBytes - MemPool counters for data_page_pool of ParquetColumnChunkReader - ParquetDataPagePoolAllocBytes - ParquetDataPagePoolAllocDuration - ParquetDataPagePoolFreeBytes - ParquetDataPagePoolFreeDuration - MemPool counters for releasing memory in RowBatch::Reset() called in FragmentInstanceState::Exec: - RowBatchMemPoolFreeDuration - RowBatchMemPoolFreeBytes - Duration in HdfsColumnarScanner::GetCollectionMemory() which includes memory allocation for collection values and memcpy when doubling the tuple buffer: - MaterializeCollectionGetMemTime Here is an example of a memory-bound query: Fragment Instance - RowBatchMemPoolFreeBytes: (Avg: 719.25 KB (736517) ; Min: 4.00 KB (4096) ; Max: 4.12 MB (4321922) ; Sum: 1.93 GB (2069615013) ; Number of samples: 2810) - RowBatchMemPoolFreeDuration: (Avg: 132.027us ; Min: 0.000ns ; Max: 21.999ms ; Sum: 370.997ms ; Number of samples: 2810) - TotalStorageWaitTime: 47.999ms - TotalThreadsInvoluntaryContextSwitches: 2 (2) - TotalThreadsMajorPageFaults: 0 (0) - TotalThreadsMinorPageFaults: 549.63K (549626) - TotalThreadsTotalWallClockTime: 9s646ms - TotalThreadsSysTime: 1s508ms - TotalThreadsUserTime: 1s791ms - TotalThreadsVoluntaryContextSwitches: 8.85K (8852) - TotalTime: 9s648ms ... HDFS_SCAN_NODE (id=0): - ParquetDataPagePoolAllocBytes: (Avg: 2.36 MB (2477480) ; Min: 4.00 KB (4096) ; Max: 4.12 MB (4321922) ; Sum: 1.02 GB (1090091508) ; Number of samples: 440) - ParquetDataPagePoolAllocDuration: (Avg: 1.263ms ; Min: 0.000ns ; Max: 39.999ms ; Sum: 555.995ms ; Number of samples: 440) - ParquetDataPagePoolFreeBytes: (Avg: 1.28 MB (1344350) ; Min: 4.00 KB (4096) ; Max: 1.53 MB (1601012) ; Sum: 282.06 MB (295757000) ; Number of samples: 220) - ParquetDataPagePoolFreeDuration: (Avg: 1.927ms ; Min: 0.000ns ; Max: 19.999ms ; Sum: 423.996ms ; Number of samples: 220) - ScratchBatchMemAllocBytes: (Avg: 486.33 KB (498004) ; Min: 4.00 KB (4096) ; Max: 512.00 KB (524288) ; Sum: 1.19 GB (1274890240) ; Number of samples: 2560) - ScratchBatchMemAllocDuration: (Avg: 1.936ms ; Min: 0.000ns ; Max: 35.999ms ; Sum: 4s956ms ; Number of samples: 2560) - ScratchBatchMemFreeDuration: 0.000ns (Number of samples: 0) - DecompressionTime: 1s396ms - MaterializeCollectionGetMemTime: 4s899ms - MaterializeTupleTime: 6s656ms - ScannerIoWaitTime: 47.999ms - TotalRawHdfsOpenFileTime: 0.000ns - TotalRawHdfsReadTime: 360.997ms - TotalTime: 9s254ms The fragment instance took 9s648ms to finish. 370.997ms spent in releasing memory of the final RowBatch. The majority of the time is spent in the scan node (9s254ms). Mostly it's DecompressionTime + MaterializeTupleTime + ScannerIoWaitTime + TotalRawHdfsReadTime. The majority is MaterializeTupleTime (6s656ms). ScratchBatchMemAllocDuration shows that invoking std::malloc() in materializing the scratch batches took 4s956ms overall. MaterializeCollectionGetMemTime shows that allocating memory for collections and copying memory in doubling the tuple buffer took 4s899ms. So materializing the collections took most of the time. Note that DecompressionTime (1s396ms) also includes memory allocation and free duration, which are tracked by ParquetDataPagePoolAllocDuration (555.995ms) and ParquetDataPagePoolFreeDuration (423.996ms). The majority of the DecompressionTime also spent in memory operations. The other observation is TotalThreadsTotalWallClockTime is much higher than TotalThreadsSysTime + TotalThreadsUserTime and there is a large number of TotalThreadsVoluntaryContextSwitches. So the thread is waiting for resources (e.g. lock) for a long duration. In the above case, it's waiting for locks in tcmalloc memory allocation (need off-cpu flame graph to reveal this). Implementation of MemPool counters Add MemPoolCounters in MemPool to track malloc/free duration and bytes. Note that counters are not updated in the destructor since it's expected that all chunks are freed or transfered before calling the destructor. MemPool is widely used in the code base. This patch only exposes MemPool counters in three places: - the scratch batch in columnar scanners - the ParquetColumnChunkReader of parquet scanners - the final RowBatch reset by FragmentInstanceState This patch also moves GetCollectionMemory() from HdfsScanner to HdfsColumnarScanner since it's only used by parquet and orc scanners. PrettyPrint of SummaryStatsCounter is updated to also show the sum of the values. Tests - tested in manually reproducing the memory-bound queries - ran perf-AB-test on tpch (sf=42) and didn't see significant performance change - added e2e tests Change-Id: I982315d96e6de20a3616f3bd2a2b4866d1ff4710 --- M be/src/exec/hdfs-columnar-scanner.cc M be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-chunk-reader.h M be/src/exec/parquet/parquet-column-readers.h M be/src/exec/scratch-tuple-batch.h M be/src/runtime/fragment-instance-state.cc M be/src/runtime/mem-pool.cc M be/src/runtime/mem-pool.h M be/src/runtime/row-batch.h M be/src/util/impala-profile-tool.cc M be/src/util/runtime-profile-counters.h M be/src/util/runtime-profile.cc M tests/query_test/test_observability.py 17 files changed, 241 insertions(+), 34 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/22062/17 -- To view, visit http://gerrit.cloudera.org:8080/22062 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I982315d96e6de20a3616f3bd2a2b4866d1ff4710 Gerrit-Change-Number: 22062 Gerrit-PatchSet: 17 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Xuebin Su <x...@cloudera.com>