Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/23588 )
Change subject: IMPALA-9715: Add devdata dataload option ...................................................................... Patch Set 11: Code-Review+1 (7 comments) Looks very useful! I mainly added high level comments, lot of them are probably out of scope for the change. http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG@13 PS11, Line 13: -devdata This is against how buildall.sh and load-data.py work now, but wouldn't it be simpler to have an env var for this, e.g. IMPALA_DATALOAD_TYPE=dev? For example if you only want to run some specific tests in a Jenkins job and know that they only need dev dataload, then injecting this env var could be used to speed up the build, while forcing the call to buildall.sh to be parameterized differently is more complicated. The env var would also make it possible to have skipIfs for this. http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG@13 PS11, Line 13: Also skips loading tables that : require Hive as they're less commonly used in testing. This is decided by having DEPENDENT_LOAD_HIVE section, right? This could be mentioned here. The biggest gap I see and what worth mentioning is that this removes most tables with complex types. http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG@19 PS11, Line 19: Loading TPC-DS data OK (Took: 1 min 58 sec) Is tpc-ds actually important for development? Or this is kept to keep more tests runnable? I practically only use it to have a table with many partitions + files (store_sales). For benchmarking a bigger scale factor is needed. http://gerrit.cloudera.org:8080/#/c/23588/11//COMMIT_MSG@31 PS11, Line 31: Reduces dataload from ~22 minutes to ~4 minutes with 16 cores. A few more things that could be interesting here: - the size of the dataload, which can be a limiting factor if someone doesn't have a large hard drive drive - I keep usually 1-2 dataloads and share them between Impala checkouts to avoid using too much space, and this may be avoidable if the size is drastically smaller with devdata as most checkouts probably only need that. - the memory on the machine and how much was used would be also nice to know, as again, this is a barrier for people with weaker machines to develep Impala - e.g. it would be great if it would turn out that 16GB is now enough for the dataload http://gerrit.cloudera.org:8080/#/c/23588/11/buildall.sh File buildall.sh: http://gerrit.cloudera.org:8080/#/c/23588/11/buildall.sh@265 PS11, Line 265: echo "[-testdata] : Loads test data. Implied as true if -snapshot_file is"\ : "specified. If -snapshot_file is not specified, data will be regenerated." Can you add dev data here? It also worth explaining how it interacts with testdata http://gerrit.cloudera.org:8080/#/c/23588/11/buildall.sh@715 PS11, Line 715: if [[ $NEED_MINICLUSTER -eq 1 ]]; then : start_test_cluster_dependencies : fi Probably it should be a different config, but if now Hive / Kudu / HBase is needed then we could skip starting these components. The biggest benefit I see is the lowered memory need by not running Kudu. http://gerrit.cloudera.org:8080/#/c/23588/11/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/23588/11/testdata/bin/create-load-data.sh@610 PS11, Line 610: warm-up-hive Does dev dataload actually no longer need Hive? If yes, then this could be skipped. -- To view, visit http://gerrit.cloudera.org:8080/23588 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia942c2df5f238c570a2fc5e547786ac62b9f3af4 Gerrit-Change-Number: 23588 Gerrit-PatchSet: 11 Gerrit-Owner: Michael Smith <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Jason Fehr <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Pranav Lodha <[email protected]> Gerrit-Comment-Date: Mon, 17 Nov 2025 15:36:50 +0000 Gerrit-HasComments: Yes
