Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23588 )

Change subject: IMPALA-9715: Add devdata dataload option
......................................................................


Patch Set 11: Code-Review+1

(7 comments)

Looks very useful!
I mainly added high level comments, lot of them are probably out of scope for 
the change.

http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG@13
PS11, Line 13: -devdata
This is against how buildall.sh and load-data.py work now, but wouldn't it be 
simpler to have an env var for this, e.g. IMPALA_DATALOAD_TYPE=dev? For example 
if you only want to run some specific tests in a Jenkins job and know that they 
only need dev dataload, then injecting this env var could be used to speed up 
the build, while forcing the call to buildall.sh to be parameterized 
differently is more complicated. The env var would also make it possible to 
have skipIfs for this.


http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG@13
PS11, Line 13:  Also skips loading tables that
             : require Hive as they're less commonly used in testing.
This is decided by having DEPENDENT_LOAD_HIVE section, right? This could be 
mentioned here.

The biggest gap I see and what worth mentioning is that this removes most 
tables with complex types.


http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG@19
PS11, Line 19:     Loading TPC-DS data OK (Took: 1 min 58 sec)
Is tpc-ds actually important for development? Or this is kept to keep more 
tests runnable? I practically only use it to have a table with many partitions 
+ files (store_sales). For benchmarking a bigger scale factor is needed.


http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG@31
PS11, Line 31: Reduces dataload from ~22 minutes to ~4 minutes with 16 cores.
A few more things that could be interesting here:
- the size of the dataload, which can be a limiting factor if someone doesn't 
have a large hard drive drive - I keep usually 1-2 dataloads and share them 
between Impala checkouts to avoid using too much space, and this may be 
avoidable if the size is drastically smaller with devdata as most checkouts 
probably only need that.
- the memory on the machine and how much was used would be also nice to know, 
as again, this is a barrier for people with weaker machines to develep Impala - 
e.g. it would be great if it would turn out that 16GB is now enough for the 
dataload


http://gerrit.cloudera.org:8080/#/c/23588/11/buildall.sh
File buildall.sh:

http://gerrit.cloudera.org:8080/#/c/23588/11/buildall.sh@265
PS11, Line 265:       echo "[-testdata] : Loads test data. Implied as true if 
-snapshot_file is"\
              :            "specified. If -snapshot_file is not specified, data 
will be regenerated."
Can you add dev data here? It also worth explaining how it interacts with 
testdata


http://gerrit.cloudera.org:8080/#/c/23588/11/buildall.sh@715
PS11, Line 715: if [[ $NEED_MINICLUSTER -eq 1 ]]; then
              :   start_test_cluster_dependencies
              : fi
Probably it should be a different config, but if now Hive / Kudu / HBase is 
needed then we could skip starting these components. The biggest benefit I see 
is the lowered memory need by not running Kudu.


http://gerrit.cloudera.org:8080/#/c/23588/11/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

http://gerrit.cloudera.org:8080/#/c/23588/11/testdata/bin/create-load-data.sh@610
PS11, Line 610:   warm-up-hive
Does dev dataload actually no longer need Hive? If yes, then this could be 
skipped.



--
To view, visit http://gerrit.cloudera.org:8080/23588
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia942c2df5f238c570a2fc5e547786ac62b9f3af4
Gerrit-Change-Number: 23588
Gerrit-PatchSet: 11
Gerrit-Owner: Michael Smith <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Pranav Lodha <[email protected]>
Gerrit-Comment-Date: Mon, 17 Nov 2025 15:36:50 +0000
Gerrit-HasComments: Yes

Reply via email to