[Impala-ASF-CR] Generate parallel data load with batch files

Michael Smith (Code Review) Tue, 04 Nov 2025 16:18:14 -0800

Michael Smith has uploaded a new patch set (#5). ( 
http://gerrit.cloudera.org:8080/23628 )


Change subject: Generate parallel data load with batch files
......................................................................

Generate parallel data load with batch files

Creates num_processes files for each phase of schema SQL dataload to
execute in parallel.

Analyzes SQL statements to create a dependency graph using networkx, and
batches statements by independent subgraphs so dependent statements are
always executed sequentially, and independent statements may be executed
concurrently.

May help significantly, but Hive compaction performance is a large
wildcard on functional-query dataload time. I've seen runs where it
takes 3m44s or 8m48s, reflecting either significant improvement or
slight regression compared to the baseline of 8m1s. Needs investigation.

Change-Id: I9586504f6cb91f873f7ed978fda3df32e759ba90
---
M bin/load-data.py
M infra/python/deps/py3-requirements.txt
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
4 files changed, 108 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/28/23628/5
--
To view, visit http://gerrit.cloudera.org:8080/23628
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9586504f6cb91f873f7ed978fda3df32e759ba90
Gerrit-Change-Number: 23628
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Smith <[email protected]>

[Impala-ASF-CR] Generate parallel data load with batch files

Reply via email to