Michael Smith has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/23628 )
Change subject: Generate parallel data load with batch files ...................................................................... Generate parallel data load with batch files Creates num_processes files for each phase of schema SQL dataload to execute in parallel. Analyzes SQL statements to create a dependency graph using networkx, and batches statements by independent subgraphs so dependent statements are always executed sequentially, and independent statements may be executed concurrently. May help significantly, but Hive compaction performance is a large wildcard on functional-query dataload time. I've seen runs where it takes 3m44s or 8m48s, reflecting either significant improvement or slight regression compared to the baseline of 8m1s. Needs investigation. Change-Id: I9586504f6cb91f873f7ed978fda3df32e759ba90 --- M bin/load-data.py M infra/python/deps/py3-requirements.txt M testdata/bin/create-load-data.sh M testdata/bin/generate-schema-statements.py 4 files changed, 108 insertions(+), 47 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/28/23628/5 -- To view, visit http://gerrit.cloudera.org:8080/23628 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9586504f6cb91f873f7ed978fda3df32e759ba90 Gerrit-Change-Number: 23628 Gerrit-PatchSet: 5 Gerrit-Owner: Michael Smith <[email protected]>
