Hi, Mingyu Chen! Thank you for your advice! I did actually help, now I have about 95% of HDFS throughput using Hive.
Kind regards, Nikita пт, 10 мар. 2023 г. в 18:37, Mingyu Chen <morning...@163.com>: > Hi Nikita, > First of all, you don't have to set all those configs and session > variables. > Setting them doesn't do anything for this problem. > > > Let me make it clear, you problem is that: > 1. Use `insert into table from hdfs()` can utilize most of your cluster's > resource, and the speed is acceptable. > 2. Use `insert into table from hive_external_table()` is much slower > because it can't utilize the cluster's resource. > > > Am I right? > If yes, here is some tips: > 1. Do not use hive external table feature, you can see this feature is > deprecated since 1.2[1]. > 2. Use hive catalog[2] to connect to your hive cluster, and then use > `insert into table from "table in hive catalog"` to do the ingest. > 3. Reset all those configs and session variables to the default value. And > only config you may need is `set global parallel_fragment_exec_instance_num > = 16` > > > > > [1] https://doris.apache.org/docs/dev/lakehouse/external-table/hive/ > [2] https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hive > > > > -- > > 此致!Best Regards > 陈明雨 Mingyu Chen > > Email: > morning...@apache.org > > > > > > 在 2023-03-10 23:05:02,"Nikita Sakharin" <nikitasa1...@gmail.com> 写道: > >Hi, Yongqiang YANG, thank you for your response! > > > >I appended fe.conf with lines: > >async_load_task_pool_size = 320 > >async_loading_load_task_pool_size = 320 > >async_pending_load_task_pool_size = 320 > > > >320 is the total amount of cores in my configuration. > > > >I had restarted all the machines in the cluster (including FE) and tried > >again. Unfortunately, it didn't help. The total amount of instances for > >Hive querying/importing is 5 as before. > > > >Do you have any advice? > > > >Kind regards, > >Nikita > > > >пт, 10 мар. 2023 г. в 17:32, Yongqiang YANG <dataroar...@gmail.com>: > > > >> You can adjust async_pending_load_task_pool_size = 250 > >> and async_loading_load_task_pool_size=250 in fe.conf and have a try. > >> > >> On Fri, Mar 10, 2023 at 8:49 PM Nikita Sakharin <nikitasa1...@gmail.com > > > >> wrote: > >> > >> > Hi, Apache Doris Community! > >> > > >> > I'm a beginner user of Doris and I'm experiencing performance issues. > >> > > >> > I have *4 hosts for BE: Intel Xeon 80 cores, 768GB RAM each.* Frontend > >> and > >> > broker were deployed on a different machine with the same > configuration. > >> > All backends have "*mix*" as *Node Role*. > >> > Doris version is doris-1.2.2-rc01. > >> > Apache Hive cluster with HDFS throughput about 1 GB/sec. Replication > >> factor > >> > is 3 > >> > TPCDS benchmark <https://github.com/databricks/spark-sql-perf> was > used: > >> > 330 GB of parquet data, snappy compression. Total size (with > replication) > >> > is 1TB. > >> > > >> > The default configuration of Apache Doris was used except these > >> parameters: > >> > ADMIN SET FRONTEND CONFIG('broker_load_default_timeout_second' = > >> '21600'); > >> > ADMIN SET FRONTEND CONFIG('default_db_max_running_txn_num' = '320'); > >> > ADMIN SET FRONTEND CONFIG('default_load_parallelism' = '80'); > >> > ADMIN SET FRONTEND CONFIG('default_max_query_instances' = '640'); > >> > ADMIN SET FRONTEND CONFIG('disable_balance' = 'false'); > >> > ADMIN SET FRONTEND CONFIG('disable_disk_balance' = 'false'); > >> > ADMIN SET FRONTEND CONFIG('max_balancing_tablets' = '320'); > >> > ADMIN SET FRONTEND CONFIG('max_broker_concurrency' = '80'); > >> > ADMIN SET FRONTEND CONFIG('max_bytes_per_broker_scanner' = > >> '68719476736'); > >> > ADMIN SET FRONTEND CONFIG('max_routine_load_job_num' = '320'); > >> > ADMIN SET FRONTEND CONFIG('max_routine_load_task_num_per_be' = '80'); > >> > ADMIN SET FRONTEND CONFIG('max_routine_load_task_concurrent_num' = > >> '320'); > >> > ADMIN SET FRONTEND CONFIG('max_running_txn_num_per_db' = '320'); > >> > ADMIN SET FRONTEND CONFIG('max_scheduling_tablets' = '20480'); > >> > ADMIN SET FRONTEND CONFIG('min_bytes_per_broker_scanner' = > '1073741824'); > >> > ADMIN SET FRONTEND CONFIG('remote_fragment_exec_timeout_ms' = > '60000'); > >> > ADMIN SET FRONTEND CONFIG('tablet_create_timeout_second' = '60'); > >> > ADMIN SET FRONTEND CONFIG('tablet_delete_timeout_second' = '60'); > >> > > >> > SET GLOBAL cpu_resource_limit = 80; > >> > SET GLOBAL enable_exchange_node_parallel_merge = true; > >> > SET GLOBAL enable_profile = true; > >> > SET GLOBAL enable_vectorized_engine = true; > >> > SET GLOBAL exec_mem_limit = 768G; > >> > SET GLOBAL parallel_exchange_instance_num = 80; > >> > SET GLOBAL parallel_fragment_exec_instance_num = 80; > >> > SET GLOBAL query_timeout = 86400; > >> > SET GLOBAL send_batch_parallelism = 80; > >> > > >> > SET PROPERTY 'max_query_instances' = '640'; > >> > SET PROPERTY 'max_user_connections' = '1024'; > >> > > >> > External tables were created in Apache Doris using commands: > >> > CREATE EXTERNAL TABLE store_returns( > >> > sr_return_time_sk INT, > >> > sr_item_sk INT, > >> > ... > >> > sr_net_loss DECIMAL(7,2) > >> > ) ENGINE=HIVE > >> > COMMENT 'HIVE' > >> > PROPERTIES( > >> > 'hive.metastore.uris' = 'thrift://xyz.com:9085', > >> > 'database' = 'tpcds_1000', > >> > 'table' = 'store_returns' > >> > ); > >> > The table in Apache Hive is partitioned by sr_return_date_sk field. > >> > > >> > Importing directly from HDFS utilizes *259 instances on four B**Es*: > >> > 66 on first BE, > >> > 65 on second BE, > >> > 64 and 64 on third and fourth BEs. > >> > Query: > >> > INSERT INTO tpcds.store_returns > >> > SELECT NULL, * > >> > FROM HDFS( > >> > 'uri' = 'hdfs:// > >> > > >> > > >> > lang33.ca.sbrf.ru:8020/tmp/tpcds_1000/store_returns/sr_returned_date_sk=__HIVE_DEFAULT_PARTITION__/* > >> > ', > >> > 'fs.defaultFS' = 'hdfs://lang33.ca.sbrf.ru:8020', > >> > 'hadoop.username' = 'hdfs', > >> > 'format' = 'parquet' > >> > ); > >> > > >> > Importing/querying Apache Hive utilizes *only 5 instances on four > BEs*: > >> > 2 on first BE, > >> > 1 on second, third and fourth BEs. > >> > Query: > >> > INSERT INTO tpcds.store_returns > >> > SELECT > >> > sr_returned_date_sk, > >> > sr_customer_sk, > >> > sr_store_sk, > >> > sr_return_time_sk, > >> > sr_item_sk, > >> > sr_cdemo_sk, > >> > sr_hdemo_sk, > >> > sr_addr_sk, > >> > sr_reason_sk, > >> > sr_ticket_number, > >> > sr_return_quantity, > >> > sr_return_amt, > >> > sr_return_tax, > >> > sr_return_amt_inc_tax, > >> > sr_fee, > >> > sr_return_ship_cost, > >> > sr_refunded_cash, > >> > sr_reversed_charge, > >> > sr_store_credit, > >> > sr_net_loss > >> > FROM tpcds_10000.store_returns; > >> > > >> > How to make Apache Doris utilize more instances for importing/querying > >> > external Hive tables? Kindly ask for your advice about > >> parameters/settings > >> > values. > >> > > >> > Kind regards, > >> > Nikita > >> > > >> >