Hi, Apache Doris Community! I'm a beginner user of Doris and I'm experiencing performance issues.
I have *4 hosts for BE: Intel Xeon 80 cores, 768GB RAM each.* Frontend and broker were deployed on a different machine with the same configuration. All backends have "*mix*" as *Node Role*. Doris version is doris-1.2.2-rc01. Apache Hive cluster with HDFS throughput about 1 GB/sec. Replication factor is 3 TPCDS benchmark <https://github.com/databricks/spark-sql-perf> was used: 330 GB of parquet data, snappy compression. Total size (with replication) is 1TB. The default configuration of Apache Doris was used except these parameters: ADMIN SET FRONTEND CONFIG('broker_load_default_timeout_second' = '21600'); ADMIN SET FRONTEND CONFIG('default_db_max_running_txn_num' = '320'); ADMIN SET FRONTEND CONFIG('default_load_parallelism' = '80'); ADMIN SET FRONTEND CONFIG('default_max_query_instances' = '640'); ADMIN SET FRONTEND CONFIG('disable_balance' = 'false'); ADMIN SET FRONTEND CONFIG('disable_disk_balance' = 'false'); ADMIN SET FRONTEND CONFIG('max_balancing_tablets' = '320'); ADMIN SET FRONTEND CONFIG('max_broker_concurrency' = '80'); ADMIN SET FRONTEND CONFIG('max_bytes_per_broker_scanner' = '68719476736'); ADMIN SET FRONTEND CONFIG('max_routine_load_job_num' = '320'); ADMIN SET FRONTEND CONFIG('max_routine_load_task_num_per_be' = '80'); ADMIN SET FRONTEND CONFIG('max_routine_load_task_concurrent_num' = '320'); ADMIN SET FRONTEND CONFIG('max_running_txn_num_per_db' = '320'); ADMIN SET FRONTEND CONFIG('max_scheduling_tablets' = '20480'); ADMIN SET FRONTEND CONFIG('min_bytes_per_broker_scanner' = '1073741824'); ADMIN SET FRONTEND CONFIG('remote_fragment_exec_timeout_ms' = '60000'); ADMIN SET FRONTEND CONFIG('tablet_create_timeout_second' = '60'); ADMIN SET FRONTEND CONFIG('tablet_delete_timeout_second' = '60'); SET GLOBAL cpu_resource_limit = 80; SET GLOBAL enable_exchange_node_parallel_merge = true; SET GLOBAL enable_profile = true; SET GLOBAL enable_vectorized_engine = true; SET GLOBAL exec_mem_limit = 768G; SET GLOBAL parallel_exchange_instance_num = 80; SET GLOBAL parallel_fragment_exec_instance_num = 80; SET GLOBAL query_timeout = 86400; SET GLOBAL send_batch_parallelism = 80; SET PROPERTY 'max_query_instances' = '640'; SET PROPERTY 'max_user_connections' = '1024'; External tables were created in Apache Doris using commands: CREATE EXTERNAL TABLE store_returns( sr_return_time_sk INT, sr_item_sk INT, ... sr_net_loss DECIMAL(7,2) ) ENGINE=HIVE COMMENT 'HIVE' PROPERTIES( 'hive.metastore.uris' = 'thrift://xyz.com:9085', 'database' = 'tpcds_1000', 'table' = 'store_returns' ); The table in Apache Hive is partitioned by sr_return_date_sk field. Importing directly from HDFS utilizes *259 instances on four B**Es*: 66 on first BE, 65 on second BE, 64 and 64 on third and fourth BEs. Query: INSERT INTO tpcds.store_returns SELECT NULL, * FROM HDFS( 'uri' = 'hdfs:// lang33.ca.sbrf.ru:8020/tmp/tpcds_1000/store_returns/sr_returned_date_sk=__HIVE_DEFAULT_PARTITION__/* ', 'fs.defaultFS' = 'hdfs://lang33.ca.sbrf.ru:8020', 'hadoop.username' = 'hdfs', 'format' = 'parquet' ); Importing/querying Apache Hive utilizes *only 5 instances on four BEs*: 2 on first BE, 1 on second, third and fourth BEs. Query: INSERT INTO tpcds.store_returns SELECT sr_returned_date_sk, sr_customer_sk, sr_store_sk, sr_return_time_sk, sr_item_sk, sr_cdemo_sk, sr_hdemo_sk, sr_addr_sk, sr_reason_sk, sr_ticket_number, sr_return_quantity, sr_return_amt, sr_return_tax, sr_return_amt_inc_tax, sr_fee, sr_return_ship_cost, sr_refunded_cash, sr_reversed_charge, sr_store_credit, sr_net_loss FROM tpcds_10000.store_returns; How to make Apache Doris utilize more instances for importing/querying external Hive tables? Kindly ask for your advice about parameters/settings values. Kind regards, Nikita