Hi, Apache Doris Community!

I'm a beginner user of Doris and I'm experiencing performance issues.

I have *4 hosts for BE: Intel Xeon 80 cores, 768GB RAM each.* Frontend and
broker were deployed on a different machine with the same configuration.
All backends have "*mix*" as *Node Role*.
Doris version is doris-1.2.2-rc01.
Apache Hive cluster with HDFS throughput about 1 GB/sec. Replication factor
is 3
TPCDS benchmark <https://github.com/databricks/spark-sql-perf> was used:
330 GB of parquet data, snappy compression. Total size (with replication)
is 1TB.

The default configuration of Apache Doris was used except these parameters:
ADMIN SET FRONTEND CONFIG('broker_load_default_timeout_second' = '21600');
ADMIN SET FRONTEND CONFIG('default_db_max_running_txn_num' = '320');
ADMIN SET FRONTEND CONFIG('default_load_parallelism' = '80');
ADMIN SET FRONTEND CONFIG('default_max_query_instances' = '640');
ADMIN SET FRONTEND CONFIG('disable_balance' = 'false');
ADMIN SET FRONTEND CONFIG('disable_disk_balance' = 'false');
ADMIN SET FRONTEND CONFIG('max_balancing_tablets' = '320');
ADMIN SET FRONTEND CONFIG('max_broker_concurrency' = '80');
ADMIN SET FRONTEND CONFIG('max_bytes_per_broker_scanner' = '68719476736');
ADMIN SET FRONTEND CONFIG('max_routine_load_job_num' = '320');
ADMIN SET FRONTEND CONFIG('max_routine_load_task_num_per_be' = '80');
ADMIN SET FRONTEND CONFIG('max_routine_load_task_concurrent_num' = '320');
ADMIN SET FRONTEND CONFIG('max_running_txn_num_per_db' = '320');
ADMIN SET FRONTEND CONFIG('max_scheduling_tablets' = '20480');
ADMIN SET FRONTEND CONFIG('min_bytes_per_broker_scanner' = '1073741824');
ADMIN SET FRONTEND CONFIG('remote_fragment_exec_timeout_ms' = '60000');
ADMIN SET FRONTEND CONFIG('tablet_create_timeout_second' = '60');
ADMIN SET FRONTEND CONFIG('tablet_delete_timeout_second' = '60');

SET GLOBAL cpu_resource_limit = 80;
SET GLOBAL enable_exchange_node_parallel_merge = true;
SET GLOBAL enable_profile = true;
SET GLOBAL enable_vectorized_engine = true;
SET GLOBAL exec_mem_limit = 768G;
SET GLOBAL parallel_exchange_instance_num = 80;
SET GLOBAL parallel_fragment_exec_instance_num = 80;
SET GLOBAL query_timeout = 86400;
SET GLOBAL send_batch_parallelism = 80;

SET PROPERTY 'max_query_instances' = '640';
SET PROPERTY 'max_user_connections' = '1024';

External tables were created in Apache Doris using commands:
CREATE EXTERNAL TABLE store_returns(
    sr_return_time_sk INT,
    sr_item_sk INT,
    ...
    sr_net_loss DECIMAL(7,2)
) ENGINE=HIVE
COMMENT 'HIVE'
PROPERTIES(
    'hive.metastore.uris' = 'thrift://xyz.com:9085',
    'database' = 'tpcds_1000',
    'table' = 'store_returns'
);
The table in Apache Hive is partitioned by sr_return_date_sk field.

Importing directly from HDFS utilizes *259 instances on four B**Es*:
66 on first BE,
65 on second BE,
64 and 64 on third and fourth BEs.
Query:
INSERT INTO tpcds.store_returns
    SELECT NULL, *
    FROM HDFS(
        'uri' = 'hdfs://
lang33.ca.sbrf.ru:8020/tmp/tpcds_1000/store_returns/sr_returned_date_sk=__HIVE_DEFAULT_PARTITION__/*
',
        'fs.defaultFS' = 'hdfs://lang33.ca.sbrf.ru:8020',
        'hadoop.username' = 'hdfs',
        'format' = 'parquet'
    );

Importing/querying Apache Hive utilizes *only 5 instances on four BEs*:
2 on first BE,
1 on second, third and fourth BEs.
Query:
INSERT INTO tpcds.store_returns
    SELECT
        sr_returned_date_sk,
        sr_customer_sk,
        sr_store_sk,
        sr_return_time_sk,
        sr_item_sk,
        sr_cdemo_sk,
        sr_hdemo_sk,
        sr_addr_sk,
        sr_reason_sk,
        sr_ticket_number,
        sr_return_quantity,
        sr_return_amt,
        sr_return_tax,
        sr_return_amt_inc_tax,
        sr_fee,
        sr_return_ship_cost,
        sr_refunded_cash,
        sr_reversed_charge,
        sr_store_credit,
        sr_net_loss
    FROM tpcds_10000.store_returns;

How to make Apache Doris utilize more instances for importing/querying
external Hive tables? Kindly ask for your advice about parameters/settings
values.

Kind regards,
Nikita

Reply via email to