Re:Re: Hive querying/importing instances

Mingyu Chen Fri, 10 Mar 2023 07:37:04 -0800

Hi Nikita,
First of all, you don't have to set all those configs and session variables.
Setting them doesn't do anything for this problem.



Let me make it clear, you problem is that:
1. Use `insert into table from hdfs()` can utilize most of your cluster's 
resource, and the speed is acceptable.
2. Use `insert into table from hive_external_table()` is much slower because it 
can't utilize the cluster's resource.


Am I right?
If yes, here is some tips:
1. Do not use hive external table feature, you can see this feature is 
deprecated since 1.2[1].
2. Use hive catalog[2] to connect to your hive cluster, and then use `insert 
into table from "table in hive catalog"` to do the ingest.
3. Reset all those configs and session variables to the default value. And only 
config you may need is `set global parallel_fragment_exec_instance_num = 16`




[1]  https://doris.apache.org/docs/dev/lakehouse/external-table/hive/
[2] https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hive



--

此致！Best Regards
陈明雨 Mingyu Chen

Email:
morning...@apache.org





在 2023-03-10 23:05:02，"Nikita Sakharin" <nikitasa1...@gmail.com> 写道：
>Hi, Yongqiang YANG, thank you for your response!
>
>I appended fe.conf with lines:
>async_load_task_pool_size = 320
>async_loading_load_task_pool_size = 320
>async_pending_load_task_pool_size = 320
>
>320 is the total amount of cores in my configuration.
>
>I had restarted all the machines in the cluster (including FE) and tried
>again. Unfortunately, it didn't help. The total amount of instances for
>Hive querying/importing is 5 as before.
>
>Do you have any advice?
>
>Kind regards,
>Nikita
>
>пт, 10 мар. 2023 г. в 17:32, Yongqiang YANG <dataroar...@gmail.com>:
>
>> You can adjust async_pending_load_task_pool_size = 250
>> and async_loading_load_task_pool_size=250 in fe.conf and have a try.
>>
>> On Fri, Mar 10, 2023 at 8:49 PM Nikita Sakharin <nikitasa1...@gmail.com>
>> wrote:
>>
>> > Hi, Apache Doris Community!
>> >
>> > I'm a beginner user of Doris and I'm experiencing performance issues.
>> >
>> > I have *4 hosts for BE: Intel Xeon 80 cores, 768GB RAM each.* Frontend
>> and
>> > broker were deployed on a different machine with the same configuration.
>> > All backends have "*mix*" as *Node Role*.
>> > Doris version is doris-1.2.2-rc01.
>> > Apache Hive cluster with HDFS throughput about 1 GB/sec. Replication
>> factor
>> > is 3
>> > TPCDS benchmark <https://github.com/databricks/spark-sql-perf> was used:
>> > 330 GB of parquet data, snappy compression. Total size (with replication)
>> > is 1TB.
>> >
>> > The default configuration of Apache Doris was used except these
>> parameters:
>> > ADMIN SET FRONTEND CONFIG('broker_load_default_timeout_second' =
>> '21600');
>> > ADMIN SET FRONTEND CONFIG('default_db_max_running_txn_num' = '320');
>> > ADMIN SET FRONTEND CONFIG('default_load_parallelism' = '80');
>> > ADMIN SET FRONTEND CONFIG('default_max_query_instances' = '640');
>> > ADMIN SET FRONTEND CONFIG('disable_balance' = 'false');
>> > ADMIN SET FRONTEND CONFIG('disable_disk_balance' = 'false');
>> > ADMIN SET FRONTEND CONFIG('max_balancing_tablets' = '320');
>> > ADMIN SET FRONTEND CONFIG('max_broker_concurrency' = '80');
>> > ADMIN SET FRONTEND CONFIG('max_bytes_per_broker_scanner' =
>> '68719476736');
>> > ADMIN SET FRONTEND CONFIG('max_routine_load_job_num' = '320');
>> > ADMIN SET FRONTEND CONFIG('max_routine_load_task_num_per_be' = '80');
>> > ADMIN SET FRONTEND CONFIG('max_routine_load_task_concurrent_num' =
>> '320');
>> > ADMIN SET FRONTEND CONFIG('max_running_txn_num_per_db' = '320');
>> > ADMIN SET FRONTEND CONFIG('max_scheduling_tablets' = '20480');
>> > ADMIN SET FRONTEND CONFIG('min_bytes_per_broker_scanner' = '1073741824');
>> > ADMIN SET FRONTEND CONFIG('remote_fragment_exec_timeout_ms' = '60000');
>> > ADMIN SET FRONTEND CONFIG('tablet_create_timeout_second' = '60');
>> > ADMIN SET FRONTEND CONFIG('tablet_delete_timeout_second' = '60');
>> >
>> > SET GLOBAL cpu_resource_limit = 80;
>> > SET GLOBAL enable_exchange_node_parallel_merge = true;
>> > SET GLOBAL enable_profile = true;
>> > SET GLOBAL enable_vectorized_engine = true;
>> > SET GLOBAL exec_mem_limit = 768G;
>> > SET GLOBAL parallel_exchange_instance_num = 80;
>> > SET GLOBAL parallel_fragment_exec_instance_num = 80;
>> > SET GLOBAL query_timeout = 86400;
>> > SET GLOBAL send_batch_parallelism = 80;
>> >
>> > SET PROPERTY 'max_query_instances' = '640';
>> > SET PROPERTY 'max_user_connections' = '1024';
>> >
>> > External tables were created in Apache Doris using commands:
>> > CREATE EXTERNAL TABLE store_returns(
>> >     sr_return_time_sk INT,
>> >     sr_item_sk INT,
>> >     ...
>> >     sr_net_loss DECIMAL(7,2)
>> > ) ENGINE=HIVE
>> > COMMENT 'HIVE'
>> > PROPERTIES(
>> >     'hive.metastore.uris' = 'thrift://xyz.com:9085',
>> >     'database' = 'tpcds_1000',
>> >     'table' = 'store_returns'
>> > );
>> > The table in Apache Hive is partitioned by sr_return_date_sk field.
>> >
>> > Importing directly from HDFS utilizes *259 instances on four B**Es*:
>> > 66 on first BE,
>> > 65 on second BE,
>> > 64 and 64 on third and fourth BEs.
>> > Query:
>> > INSERT INTO tpcds.store_returns
>> >     SELECT NULL, *
>> >     FROM HDFS(
>> >         'uri' = 'hdfs://
>> >
>> >
>> lang33.ca.sbrf.ru:8020/tmp/tpcds_1000/store_returns/sr_returned_date_sk=__HIVE_DEFAULT_PARTITION__/*
>> > ',
>> >         'fs.defaultFS' = 'hdfs://lang33.ca.sbrf.ru:8020',
>> >         'hadoop.username' = 'hdfs',
>> >         'format' = 'parquet'
>> >     );
>> >
>> > Importing/querying Apache Hive utilizes *only 5 instances on four BEs*:
>> > 2 on first BE,
>> > 1 on second, third and fourth BEs.
>> > Query:
>> > INSERT INTO tpcds.store_returns
>> >     SELECT
>> >         sr_returned_date_sk,
>> >         sr_customer_sk,
>> >         sr_store_sk,
>> >         sr_return_time_sk,
>> >         sr_item_sk,
>> >         sr_cdemo_sk,
>> >         sr_hdemo_sk,
>> >         sr_addr_sk,
>> >         sr_reason_sk,
>> >         sr_ticket_number,
>> >         sr_return_quantity,
>> >         sr_return_amt,
>> >         sr_return_tax,
>> >         sr_return_amt_inc_tax,
>> >         sr_fee,
>> >         sr_return_ship_cost,
>> >         sr_refunded_cash,
>> >         sr_reversed_charge,
>> >         sr_store_credit,
>> >         sr_net_loss
>> >     FROM tpcds_10000.store_returns;
>> >
>> > How to make Apache Doris utilize more instances for importing/querying
>> > external Hive tables? Kindly ask for your advice about
>> parameters/settings
>> > values.
>> >
>> > Kind regards,
>> > Nikita
>> >
>>

Re:Re: Hive querying/importing instances

Reply via email to