hello Okumin.
After porting the patch HIVE-28428, I observed that the slow query problem 
disappeared.Currently the query efficiency for orc+zstd table is basically the 
same as orc+snappy.
Therefore I strongly recommend addingHIVE-28428to version 4.0.1.
Regards,
lisoda









At 2024-09-02 23:28:25, "Okumin" <m...@okumin.com> wrote:
>Hi Clinton,
>
>Thanks for sharing your problem. If you provide more information, such
>as a dataset or queries, we can reproduce it and file the problem.
>
>Hi Lisoda,
>
>Thanks for giving us real examples. Interesting. Can I understand the
>first problem that happens when there is a big data file in an Iceberg
>table? I could not reproduce it[1], and I am curious about the
>detailed conditions. As for the map-side aggregation, we also found
>and resolved a similar problem[2]. It could be good to check. The
>other issues are also interesting. I'd file tickets if I had evidence.
>
>- [1] https://gist.github.com/okumin/4fccec45109fc9927a22f40c166fe7f9
>- [2] https://issues.apache.org/jira/browse/HIVE-28428
>
>Regards,
>Okumin
>
>On Mon, Sep 2, 2024 at 2:59 PM clinton chikwata <clintonfun...@gmail.com> 
>wrote:
>>
>> Hello Lisoda,
>>
>> Thanks for this information.
>>
>>
>> On Sun, Sep 1, 2024 at 4:04 PM lisoda <lis...@yeah.net> wrote:
>>>
>>> Hello Clinton:
>>>
>>> We have actually encountered the same issue where, in many cases, querying 
>>> Iceberg does not meet expected efficiency, falling short of regular 
>>> ORC/Parquet tables in speed. Since the current HiveIcebergInputSplit does 
>>> not support splits based on file size, reading can be slow when individual 
>>> data files in Iceberg are excessively large. This issue necessitates 
>>> improvements from community developers in future iterations. Additionally, 
>>> if Iceberg tables employ zstd compression, the current handling via Hive's 
>>> aircompress library, a Java library, is notably less efficient than JNI 
>>> implementations. This might only improve after a reconstruction leveraging 
>>> JDK-SIMD. Furthermore, we have analyzed execution latency using flame 
>>> graphs and discovered potential issues with the implementation of 
>>> VectorGroupByOperator$ProcessingModeHashAggregate, which exhibits 
>>> exceedingly poor performance. Consequently, as of now, with Iceberg tables, 
>>> we can temporarily address the issue by increasing the number of map-tasks 
>>> and reducing the size of individual data files in the Iceberg table. We 
>>> hope these issues can be resolved in subsequent iterative developments.
>>>
>>>
>>>
>>>
>>>
>>>
>>> 在 2024-08-28 14:41:03,"clinton chikwata" <clintonfun...@gmail.com> 写道:
>>>
>>> Thanks  Okumin.
>>>
>>> I am new to Hive and Tez  and I have struggled to deploy a high-performance 
>>> Dockerized Hive setup. I followed the documentation for setting up a remote 
>>> Metastore. I have a single node with 32 GB of RAM and 8 cores, but I have a 
>>> dataset of about 2 GB (Iceberg table partitioned on one column). However, 
>>> when I run select queries, the performance has not been as fast as 
>>> expected. Could someone share some insights, especially regarding 
>>> hive-site.xml and Tez custom configuration?
>>>
>>> Any help would be appreciated.
>>>
>>> On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote:
>>>>
>>>> Hi Clinton,
>>>>
>>>> I tested MERGE INTO with minimal reproduction. I saw the same error.
>>>>
>>>> ```
>>>> CREATE TABLE src (col1 INT, col2 INT);
>>>> CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2
>>>> INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG;
>>>>
>>>> MERGE INTO dst d USING src s ON s.col1 = d.col1
>>>> WHEN MATCHED THEN UPDATE SET col2 = s.col2
>>>> WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2);
>>>> ```
>>>>
>>>> The following query, which explicitly inserts `id`, succeeded on my
>>>> machine. The default keyword is unlikely to work on INSERT on MERGE
>>>> INTO. I've yet to investigate whether ANSI allows us to omit it.
>>>>
>>>> ```
>>>> MERGE INTO dst d USING src s ON s.col1 = d.col1
>>>> WHEN MATCHED THEN UPDATE SET col2 = s.col2
>>>> WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(),
>>>> s.col1, s.col2);
>>>> ```
>>>>
>>>> As another point, the SURROGATE_KEY might not work as you expected. It
>>>> doesn't generate globally unique ids on my attempts.
>>>>
>>>> Regards,
>>>> Okumin
>>>>
>>>> On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata
>>>> <clintonfun...@gmail.com> wrote:
>>>> >
>>>> > Dear Team,
>>>> >
>>>> > Any help will be much appreciated.
>>>> >
>>>> > Error SQL Error [40000] [42000]: Error while compiling statement: 
>>>> > FAILED: SemanticException Schema of both sides of union should match.
>>>> >
>>>> > I have an ETL workload that stores data into temp_table with the schema 
>>>> > as shown below.
>>>> >
>>>> > CREATE EXTERNAL TABLE IF NOT EXISTS temp_table (
>>>> >     VC_ITEM_CODE STRING,
>>>> >     VC_SUB_GROUP STRING,
>>>> >     VC_PRODUCT_NAME STRING,
>>>> >     VC_PRODUCT_UNIT STRING,
>>>> >     VC_GROUP_CODE STRING,
>>>> >     DT_VAT_START TIMESTAMP,
>>>> >     VC_BAND_CODE STRING,
>>>> >     VC_SEMI_BAND_CODE STRING,
>>>> >     VC_DIVISIONS STRING,
>>>> >     NU_UNIT_FACTOR DECIMAL(30, 0),
>>>> >     VC_DIVISION_SEG_CODE STRING,
>>>> >     VC_COLOR_COMB STRING,
>>>> >     DT_MOD_DATE TIMESTAMP,
>>>> >     VC_INACTIVE_PRODUCT STRING,
>>>> >     RN DECIMAL(10, 0),
>>>> >     country STRING
>>>> >     )
>>>> > STORED AS PARQUET
>>>> > LOCATION 'S{path}'
>>>> >
>>>> > Then i want to load it to the final table
>>>> >
>>>> > CREATE TABLE product_dimension (
>>>> >    `ID` BIGINT DEFAULT SURROGATE_KEY(),
>>>> >     VC_ITEM_CODE STRING,
>>>> >     VC_SUB_GROUP STRING,
>>>> >     VC_PRODUCT_NAME STRING,
>>>> >     VC_PRODUCT_UNIT STRING,
>>>> >     VC_GROUP_CODE STRING,
>>>> >     DT_VAT_START TIMESTAMP,
>>>> >     VC_BAND_CODE STRING,
>>>> >     VC_SEMI_BAND_CODE STRING,
>>>> >     VC_DIVISIONS STRING,
>>>> >     NU_UNIT_FACTOR DECIMAL(30, 0),
>>>> >     VC_DIVISION_SEG_CODE STRING,
>>>> >     VC_COLOR_COMB STRING,
>>>> >     DT_MOD_DATE TIMESTAMP,
>>>> >     VC_INACTIVE_PRODUCT STRING,
>>>> >     RN DECIMAL(10, 0),
>>>> >     country STRING,
>>>> >     PRIMARY KEY (ID) DISABLE NOVALIDATE)
>>>> > STORED BY ICEBERG;
>>>> >
>>>> > When I attempt to perform a merge operation on column  vc_item_code i 
>>>> > get the error as shown above :
>>>> >
>>>> > MERGE
>>>> > INTO
>>>> > product_dimension AS c
>>>> > USING (
>>>> > SELECT
>>>> > *
>>>> > FROM
>>>> > temp_table) AS s ON  s.vc_item_code = c.vc_item_code
>>>> > AND s.country = c.country
>>>> > WHEN MATCHED THEN
>>>> > UPDATE
>>>> > SET
>>>> > vc_item_code = s.vc_item_code,
>>>> > vc_sub_group = s.vc_sub_group,
>>>> > vc_product_name = s.vc_product_name,
>>>> > vc_product_unit = s.vc_product_unit,
>>>> > vc_group_code = s.vc_group_code,
>>>> > dt_vat_start = s.dt_vat_start,
>>>> > vc_band_code = s.vc_band_code,
>>>> > vc_semi_band_code = s.vc_semi_band_code,
>>>> > vc_divisions = s.vc_divisions,
>>>> > nu_unit_factor = s.nu_unit_factor,
>>>> > vc_division_seg_code = s.vc_division_seg_code,
>>>> > vc_color_comb = s.vc_color_comb,
>>>> > dt_mod_date = s.dt_mod_date,
>>>> > vc_inactive_product = s.vc_inactive_product,
>>>> > rn = s.rn,
>>>> > country = s.country
>>>> > WHEN NOT MATCHED THEN
>>>> > INSERT
>>>> > (
>>>> >     vc_item_code,
>>>> > vc_sub_group,
>>>> > vc_product_name,
>>>> > vc_product_unit,
>>>> > vc_group_code,
>>>> > dt_vat_start,
>>>> > vc_band_code,
>>>> > vc_semi_band_code,
>>>> > vc_divisions,
>>>> > nu_unit_factor,
>>>> > vc_division_seg_code,
>>>> > vc_color_comb,
>>>> > dt_mod_date,
>>>> > vc_inactive_product,
>>>> > rn,
>>>> > country
>>>> >     )
>>>> > VALUES (
>>>> > s.vc_item_code,
>>>> > s.vc_sub_group,
>>>> > s.vc_product_name,
>>>> > s.vc_product_unit,
>>>> > s.vc_group_code,
>>>> > s.dt_vat_start,
>>>> > s.vc_band_code,
>>>> > s.vc_semi_band_code,
>>>> > s.vc_divisions,
>>>> > s.nu_unit_factor,
>>>> > s.vc_division_seg_code,
>>>> > s.vc_color_comb,
>>>> > s.dt_mod_date,
>>>> > s.vc_inactive_product,
>>>> > s.rn,
>>>> > s.country
>>>> > );
>>>> >
>>>> > Warm Regards

Reply via email to