Hi, Mingyu Chen!

Thank you for your advice! I did actually help, now I have about 95%
of HDFS throughput using Hive.


Kind regards,
Nikita

пт, 10 мар. 2023 г. в 18:37, Mingyu Chen <morning...@163.com>:

> Hi Nikita,
> First of all, you don't have to set all those configs and session
> variables.
> Setting them doesn't do anything for this problem.
>
>
> Let me make it clear, you problem is that:
> 1. Use `insert into table from hdfs()` can utilize most of your cluster's
> resource, and the speed is acceptable.
> 2. Use `insert into table from hive_external_table()` is much slower
> because it can't utilize the cluster's resource.
>
>
> Am I right?
> If yes, here is some tips:
> 1. Do not use hive external table feature, you can see this feature is
> deprecated since 1.2[1].
> 2. Use hive catalog[2] to connect to your hive cluster, and then use
> `insert into table from "table in hive catalog"` to do the ingest.
> 3. Reset all those configs and session variables to the default value. And
> only config you may need is `set global parallel_fragment_exec_instance_num
> = 16`
>
>
>
>
> [1]  https://doris.apache.org/docs/dev/lakehouse/external-table/hive/
> [2] https://doris.apache.org/docs/dev/lakehouse/multi-catalog/hive
>
>
>
> --
>
> 此致!Best Regards
> 陈明雨 Mingyu Chen
>
> Email:
> morning...@apache.org
>
>
>
>
>
> 在 2023-03-10 23:05:02,"Nikita Sakharin" <nikitasa1...@gmail.com> 写道:
> >Hi, Yongqiang YANG, thank you for your response!
> >
> >I appended fe.conf with lines:
> >async_load_task_pool_size = 320
> >async_loading_load_task_pool_size = 320
> >async_pending_load_task_pool_size = 320
> >
> >320 is the total amount of cores in my configuration.
> >
> >I had restarted all the machines in the cluster (including FE) and tried
> >again. Unfortunately, it didn't help. The total amount of instances for
> >Hive querying/importing is 5 as before.
> >
> >Do you have any advice?
> >
> >Kind regards,
> >Nikita
> >
> >пт, 10 мар. 2023 г. в 17:32, Yongqiang YANG <dataroar...@gmail.com>:
> >
> >> You can adjust async_pending_load_task_pool_size = 250
> >> and async_loading_load_task_pool_size=250 in fe.conf and have a try.
> >>
> >> On Fri, Mar 10, 2023 at 8:49 PM Nikita Sakharin <nikitasa1...@gmail.com
> >
> >> wrote:
> >>
> >> > Hi, Apache Doris Community!
> >> >
> >> > I'm a beginner user of Doris and I'm experiencing performance issues.
> >> >
> >> > I have *4 hosts for BE: Intel Xeon 80 cores, 768GB RAM each.* Frontend
> >> and
> >> > broker were deployed on a different machine with the same
> configuration.
> >> > All backends have "*mix*" as *Node Role*.
> >> > Doris version is doris-1.2.2-rc01.
> >> > Apache Hive cluster with HDFS throughput about 1 GB/sec. Replication
> >> factor
> >> > is 3
> >> > TPCDS benchmark <https://github.com/databricks/spark-sql-perf> was
> used:
> >> > 330 GB of parquet data, snappy compression. Total size (with
> replication)
> >> > is 1TB.
> >> >
> >> > The default configuration of Apache Doris was used except these
> >> parameters:
> >> > ADMIN SET FRONTEND CONFIG('broker_load_default_timeout_second' =
> >> '21600');
> >> > ADMIN SET FRONTEND CONFIG('default_db_max_running_txn_num' = '320');
> >> > ADMIN SET FRONTEND CONFIG('default_load_parallelism' = '80');
> >> > ADMIN SET FRONTEND CONFIG('default_max_query_instances' = '640');
> >> > ADMIN SET FRONTEND CONFIG('disable_balance' = 'false');
> >> > ADMIN SET FRONTEND CONFIG('disable_disk_balance' = 'false');
> >> > ADMIN SET FRONTEND CONFIG('max_balancing_tablets' = '320');
> >> > ADMIN SET FRONTEND CONFIG('max_broker_concurrency' = '80');
> >> > ADMIN SET FRONTEND CONFIG('max_bytes_per_broker_scanner' =
> >> '68719476736');
> >> > ADMIN SET FRONTEND CONFIG('max_routine_load_job_num' = '320');
> >> > ADMIN SET FRONTEND CONFIG('max_routine_load_task_num_per_be' = '80');
> >> > ADMIN SET FRONTEND CONFIG('max_routine_load_task_concurrent_num' =
> >> '320');
> >> > ADMIN SET FRONTEND CONFIG('max_running_txn_num_per_db' = '320');
> >> > ADMIN SET FRONTEND CONFIG('max_scheduling_tablets' = '20480');
> >> > ADMIN SET FRONTEND CONFIG('min_bytes_per_broker_scanner' =
> '1073741824');
> >> > ADMIN SET FRONTEND CONFIG('remote_fragment_exec_timeout_ms' =
> '60000');
> >> > ADMIN SET FRONTEND CONFIG('tablet_create_timeout_second' = '60');
> >> > ADMIN SET FRONTEND CONFIG('tablet_delete_timeout_second' = '60');
> >> >
> >> > SET GLOBAL cpu_resource_limit = 80;
> >> > SET GLOBAL enable_exchange_node_parallel_merge = true;
> >> > SET GLOBAL enable_profile = true;
> >> > SET GLOBAL enable_vectorized_engine = true;
> >> > SET GLOBAL exec_mem_limit = 768G;
> >> > SET GLOBAL parallel_exchange_instance_num = 80;
> >> > SET GLOBAL parallel_fragment_exec_instance_num = 80;
> >> > SET GLOBAL query_timeout = 86400;
> >> > SET GLOBAL send_batch_parallelism = 80;
> >> >
> >> > SET PROPERTY 'max_query_instances' = '640';
> >> > SET PROPERTY 'max_user_connections' = '1024';
> >> >
> >> > External tables were created in Apache Doris using commands:
> >> > CREATE EXTERNAL TABLE store_returns(
> >> >     sr_return_time_sk INT,
> >> >     sr_item_sk INT,
> >> >     ...
> >> >     sr_net_loss DECIMAL(7,2)
> >> > ) ENGINE=HIVE
> >> > COMMENT 'HIVE'
> >> > PROPERTIES(
> >> >     'hive.metastore.uris' = 'thrift://xyz.com:9085',
> >> >     'database' = 'tpcds_1000',
> >> >     'table' = 'store_returns'
> >> > );
> >> > The table in Apache Hive is partitioned by sr_return_date_sk field.
> >> >
> >> > Importing directly from HDFS utilizes *259 instances on four B**Es*:
> >> > 66 on first BE,
> >> > 65 on second BE,
> >> > 64 and 64 on third and fourth BEs.
> >> > Query:
> >> > INSERT INTO tpcds.store_returns
> >> >     SELECT NULL, *
> >> >     FROM HDFS(
> >> >         'uri' = 'hdfs://
> >> >
> >> >
> >>
> lang33.ca.sbrf.ru:8020/tmp/tpcds_1000/store_returns/sr_returned_date_sk=__HIVE_DEFAULT_PARTITION__/*
> >> > ',
> >> >         'fs.defaultFS' = 'hdfs://lang33.ca.sbrf.ru:8020',
> >> >         'hadoop.username' = 'hdfs',
> >> >         'format' = 'parquet'
> >> >     );
> >> >
> >> > Importing/querying Apache Hive utilizes *only 5 instances on four
> BEs*:
> >> > 2 on first BE,
> >> > 1 on second, third and fourth BEs.
> >> > Query:
> >> > INSERT INTO tpcds.store_returns
> >> >     SELECT
> >> >         sr_returned_date_sk,
> >> >         sr_customer_sk,
> >> >         sr_store_sk,
> >> >         sr_return_time_sk,
> >> >         sr_item_sk,
> >> >         sr_cdemo_sk,
> >> >         sr_hdemo_sk,
> >> >         sr_addr_sk,
> >> >         sr_reason_sk,
> >> >         sr_ticket_number,
> >> >         sr_return_quantity,
> >> >         sr_return_amt,
> >> >         sr_return_tax,
> >> >         sr_return_amt_inc_tax,
> >> >         sr_fee,
> >> >         sr_return_ship_cost,
> >> >         sr_refunded_cash,
> >> >         sr_reversed_charge,
> >> >         sr_store_credit,
> >> >         sr_net_loss
> >> >     FROM tpcds_10000.store_returns;
> >> >
> >> > How to make Apache Doris utilize more instances for importing/querying
> >> > external Hive tables? Kindly ask for your advice about
> >> parameters/settings
> >> > values.
> >> >
> >> > Kind regards,
> >> > Nikita
> >> >
> >>
>

Reply via email to