[
https://issues.apache.org/jira/browse/IMPALA-14553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18038920#comment-18038920
]
ASF subversion and git services commented on IMPALA-14553:
----------------------------------------------------------
Commit 166b39547e033956e3f5c941cb36165c59a18275 in impala's branch
refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=166b39547 ]
IMPALA-14553: Run schema eval concurrently
The majority of time spent in generate-schema-statements.py is in
eval_section for schema operations that shell out, often uploading files
via the hadoop CLI or generating data files. These operations should be
independent.
Runs eval_section at the beginning so we don't repeat it for each row in
test_vectors, and executes them in parallel via a ThreadPool. Defaults
to NUM_CONCURRENT_TESTS threads because the underlying operations have
some concurrency to them (such as HDFS mirroring writes).
Also collects existing tables into a set to optimize lookup.
Reduces generate-schema-statements by ~60%, from 2m30s to 1m. Confirmed
that contents of logs/data_loading/sql/functional are identical.
Change-Id: I2a78d05fd6a0005c83561978713237da2dde6af2
Reviewed-on: http://gerrit.cloudera.org:8080/23627
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Michael Smith <[email protected]>
> Speed up generate-schema-statements
> -----------------------------------
>
> Key: IMPALA-14553
> URL: https://issues.apache.org/jira/browse/IMPALA-14553
> Project: IMPALA
> Issue Type: Task
> Components: Infrastructure
> Reporter: Michael Smith
> Assignee: Michael Smith
> Priority: Minor
>
> Just generating the schemas for functional-query with
> {code}
> ./testdata/bin/generate-schema-statements.py --workload=functional-query
> {code}
> can take over 2 minutes. Most of that time is spent handling eval statements
> in functional_schema_template.sql (the shell commands that start with {{`}}).
> {{eval_section}} is called for each row in test_vectors, even though the
> operations don't change.
> We can speed this up by running {{eval_section}} once for each field, and
> running them in parallel.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]