[jira] [Commented] (IMPALA-14553) Speed up generate-schema-statements

ASF subversion and git services (Jira) Mon, 17 Nov 2025 08:35:06 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-14553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18038920#comment-18038920
 ]


ASF subversion and git services commented on IMPALA-14553:
----------------------------------------------------------

Commit 166b39547e033956e3f5c941cb36165c59a18275 in impala's branch 
refs/heads/master from Michael Smith
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=166b39547 ]

IMPALA-14553: Run schema eval concurrently

The majority of time spent in generate-schema-statements.py is in
eval_section for schema operations that shell out, often uploading files
via the hadoop CLI or generating data files. These operations should be
independent.

Runs eval_section at the beginning so we don't repeat it for each row in
test_vectors, and executes them in parallel via a ThreadPool. Defaults
to NUM_CONCURRENT_TESTS threads because the underlying operations have
some concurrency to them (such as HDFS mirroring writes).

Also collects existing tables into a set to optimize lookup.

Reduces generate-schema-statements by ~60%, from 2m30s to 1m. Confirmed
that contents of logs/data_loading/sql/functional are identical.

Change-Id: I2a78d05fd6a0005c83561978713237da2dde6af2
Reviewed-on: http://gerrit.cloudera.org:8080/23627
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Michael Smith <[email protected]>


> Speed up generate-schema-statements
> -----------------------------------
>
>                 Key: IMPALA-14553
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14553
>             Project: IMPALA
>          Issue Type: Task
>          Components: Infrastructure
>            Reporter: Michael Smith
>            Assignee: Michael Smith
>            Priority: Minor
>
> Just generating the schemas for functional-query with
> {code}
> ./testdata/bin/generate-schema-statements.py --workload=functional-query
> {code}
> can take over 2 minutes. Most of that time is spent handling eval statements 
> in functional_schema_template.sql (the shell commands that start with {{`}}). 
> {{eval_section}} is called for each row in test_vectors, even though the 
> operations don't change.
> We can speed this up by running {{eval_section}} once for each field, and 
> running them in parallel.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-14553) Speed up generate-schema-statements

Reply via email to