Re: Re: Re: Merge Operation Failing Results in this SQL Error [40000] [42000]

Okumin Mon, 09 Sep 2024 04:57:11 -0700

Hi Clinton,

I'd like you to start a new thread about the problem with MV. It is
far from the original syntax problem, and MV is a fairly advanced
feature. A new thread with a proper title would attract more
appropriate people.


You can also present the minimal SQLs that reproduce the problem on
the thread. I tried to test MV + Iceberg on the official Docker image
but couldn't produce the error. I am guessing the file paths in the
error message would indicate you might not be using Iceberg.

To Lisoda,

I believe HIVE-28428 resolved your issue. That sounds great! If you
still have to struggle with the high compression ratio, you may try to
tune the stripe size of ORC, Tez grouping parameters, or the buffer
size of PipelinedSorter. I've not tested the query yet; this is just a
guess.
https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works

Regards,
Okumin

On Wed, Sep 4, 2024 at 10:21 PM lisoda <lis...@yeah.net> wrote:
>
> hello Okumin.
> After porting the patch HIVE-28428, I observed that the slow query problem 
> disappeared.Currently the query efficiency for orc+zstd table is basically 
> the same as orc+snappy.
> Therefore I strongly recommend addingHIVE-28428to version 4.0.1.
>
> Regards,
>
> lisoda
>
>
>
>
>
>
> At 2024-09-02 23:28:25, "Okumin" <m...@okumin.com> wrote:
> >Hi Clinton,
> >
> >Thanks for sharing your problem. If you provide more information, such
> >as a dataset or queries, we can reproduce it and file the problem.
> >
> >Hi Lisoda,
> >
> >Thanks for giving us real examples. Interesting. Can I understand the
> >first problem that happens when there is a big data file in an Iceberg
> >table? I could not reproduce it[1], and I am curious about the
> >detailed conditions. As for the map-side aggregation, we also found
> >and resolved a similar problem[2]. It could be good to check. The
> >other issues are also interesting. I'd file tickets if I had evidence.
> >
> >- [1] https://gist.github.com/okumin/4fccec45109fc9927a22f40c166fe7f9
> >- [2] https://issues.apache.org/jira/browse/HIVE-28428
> >
> >Regards,
> >Okumin
> >
> >On Mon, Sep 2, 2024 at 2:59 PM clinton chikwata <clintonfun...@gmail.com> 
> >wrote:
> >>
> >> Hello Lisoda,
> >>
> >> Thanks for this information.
> >>
> >>
> >> On Sun, Sep 1, 2024 at 4:04 PM lisoda <lis...@yeah.net> wrote:
> >>>
> >>> Hello Clinton:
> >>>
> >>> We have actually encountered the same issue where, in many cases, 
> >>> querying Iceberg does not meet expected efficiency, falling short of 
> >>> regular ORC/Parquet tables in speed. Since the current 
> >>> HiveIcebergInputSplit does not support splits based on file size, reading 
> >>> can be slow when individual data files in Iceberg are excessively large. 
> >>> This issue necessitates improvements from community developers in future 
> >>> iterations. Additionally, if Iceberg tables employ zstd compression, the 
> >>> current handling via Hive's aircompress library, a Java library, is 
> >>> notably less efficient than JNI implementations. This might only improve 
> >>> after a reconstruction leveraging JDK-SIMD. Furthermore, we have analyzed 
> >>> execution latency using flame graphs and discovered potential issues with 
> >>> the implementation of VectorGroupByOperator$ProcessingModeHashAggregate, 
> >>> which exhibits exceedingly poor performance. Consequently, as of now, 
> >>> with Iceberg tables, we can temporarily address the issue by increasing 
> >>> the number of map-tasks and reducing the size of individual data files in 
> >>> the Iceberg table. We hope these issues can be resolved in subsequent 
> >>> iterative developments.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> 在 2024-08-28 14:41:03，"clinton chikwata" <clintonfun...@gmail.com> 写道：
> >>>
> >>> Thanks  Okumin.
> >>>
> >>> I am new to Hive and Tez  and I have struggled to deploy a 
> >>> high-performance Dockerized Hive setup. I followed the documentation for 
> >>> setting up a remote Metastore. I have a single node with 32 GB of RAM and 
> >>> 8 cores, but I have a dataset of about 2 GB (Iceberg table partitioned on 
> >>> one column). However, when I run select queries, the performance has not 
> >>> been as fast as expected. Could someone share some insights, especially 
> >>> regarding hive-site.xml and Tez custom configuration?
> >>>
> >>> Any help would be appreciated.
> >>>
> >>> On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote:
> >>>>
> >>>> Hi Clinton,
> >>>>
> >>>> I tested MERGE INTO with minimal reproduction. I saw the same error.
> >>>>
> >>>> ```
> >>>> CREATE TABLE src (col1 INT, col2 INT);
> >>>> CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2
> >>>> INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG;
> >>>>
> >>>> MERGE INTO dst d USING src s ON s.col1 = d.col1
> >>>> WHEN MATCHED THEN UPDATE SET col2 = s.col2
> >>>> WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2);
> >>>> ```
> >>>>
> >>>> The following query, which explicitly inserts `id`, succeeded on my
> >>>> machine. The default keyword is unlikely to work on INSERT on MERGE
> >>>> INTO. I've yet to investigate whether ANSI allows us to omit it.
> >>>>
> >>>> ```
> >>>> MERGE INTO dst d USING src s ON s.col1 = d.col1
> >>>> WHEN MATCHED THEN UPDATE SET col2 = s.col2
> >>>> WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(),
> >>>> s.col1, s.col2);
> >>>> ```
> >>>>
> >>>> As another point, the SURROGATE_KEY might not work as you expected. It
> >>>> doesn't generate globally unique ids on my attempts.
> >>>>
> >>>> Regards,
> >>>> Okumin
> >>>>
> >>>> On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata
> >>>> <clintonfun...@gmail.com> wrote:
> >>>> >
> >>>> > Dear Team,
> >>>> >
> >>>> > Any help will be much appreciated.
> >>>> >
> >>>> > Error SQL Error [40000] [42000]: Error while compiling statement: 
> >>>> > FAILED: SemanticException Schema of both sides of union should match.
> >>>> >
> >>>> > I have an ETL workload that stores data into temp_table with the 
> >>>> > schema as shown below.
> >>>> >
> >>>> > CREATE EXTERNAL TABLE IF NOT EXISTS temp_table (
> >>>> >     VC_ITEM_CODE STRING,
> >>>> >     VC_SUB_GROUP STRING,
> >>>> >     VC_PRODUCT_NAME STRING,
> >>>> >     VC_PRODUCT_UNIT STRING,
> >>>> >     VC_GROUP_CODE STRING,
> >>>> >     DT_VAT_START TIMESTAMP,
> >>>> >     VC_BAND_CODE STRING,
> >>>> >     VC_SEMI_BAND_CODE STRING,
> >>>> >     VC_DIVISIONS STRING,
> >>>> >     NU_UNIT_FACTOR DECIMAL(30, 0),
> >>>> >     VC_DIVISION_SEG_CODE STRING,
> >>>> >     VC_COLOR_COMB STRING,
> >>>> >     DT_MOD_DATE TIMESTAMP,
> >>>> >     VC_INACTIVE_PRODUCT STRING,
> >>>> >     RN DECIMAL(10, 0),
> >>>> >     country STRING
> >>>> >     )
> >>>> > STORED AS PARQUET
> >>>> > LOCATION 'S{path}'
> >>>> >
> >>>> > Then i want to load it to the final table
> >>>> >
> >>>> > CREATE TABLE product_dimension (
> >>>> >    `ID` BIGINT DEFAULT SURROGATE_KEY(),
> >>>> >     VC_ITEM_CODE STRING,
> >>>> >     VC_SUB_GROUP STRING,
> >>>> >     VC_PRODUCT_NAME STRING,
> >>>> >     VC_PRODUCT_UNIT STRING,
> >>>> >     VC_GROUP_CODE STRING,
> >>>> >     DT_VAT_START TIMESTAMP,
> >>>> >     VC_BAND_CODE STRING,
> >>>> >     VC_SEMI_BAND_CODE STRING,
> >>>> >     VC_DIVISIONS STRING,
> >>>> >     NU_UNIT_FACTOR DECIMAL(30, 0),
> >>>> >     VC_DIVISION_SEG_CODE STRING,
> >>>> >     VC_COLOR_COMB STRING,
> >>>> >     DT_MOD_DATE TIMESTAMP,
> >>>> >     VC_INACTIVE_PRODUCT STRING,
> >>>> >     RN DECIMAL(10, 0),
> >>>> >     country STRING,
> >>>> >     PRIMARY KEY (ID) DISABLE NOVALIDATE)
> >>>> > STORED BY ICEBERG;
> >>>> >
> >>>> > When I attempt to perform a merge operation on column  vc_item_code i 
> >>>> > get the error as shown above :
> >>>> >
> >>>> > MERGE
> >>>> > INTO
> >>>> > product_dimension AS c
> >>>> > USING (
> >>>> > SELECT
> >>>> > *
> >>>> > FROM
> >>>> > temp_table) AS s ON  s.vc_item_code = c.vc_item_code
> >>>> > AND s.country = c.country
> >>>> > WHEN MATCHED THEN
> >>>> > UPDATE
> >>>> > SET
> >>>> > vc_item_code = s.vc_item_code,
> >>>> > vc_sub_group = s.vc_sub_group,
> >>>> > vc_product_name = s.vc_product_name,
> >>>> > vc_product_unit = s.vc_product_unit,
> >>>> > vc_group_code = s.vc_group_code,
> >>>> > dt_vat_start = s.dt_vat_start,
> >>>> > vc_band_code = s.vc_band_code,
> >>>> > vc_semi_band_code = s.vc_semi_band_code,
> >>>> > vc_divisions = s.vc_divisions,
> >>>> > nu_unit_factor = s.nu_unit_factor,
> >>>> > vc_division_seg_code = s.vc_division_seg_code,
> >>>> > vc_color_comb = s.vc_color_comb,
> >>>> > dt_mod_date = s.dt_mod_date,
> >>>> > vc_inactive_product = s.vc_inactive_product,
> >>>> > rn = s.rn,
> >>>> > country = s.country
> >>>> > WHEN NOT MATCHED THEN
> >>>> > INSERT
> >>>> > (
> >>>> >     vc_item_code,
> >>>> > vc_sub_group,
> >>>> > vc_product_name,
> >>>> > vc_product_unit,
> >>>> > vc_group_code,
> >>>> > dt_vat_start,
> >>>> > vc_band_code,
> >>>> > vc_semi_band_code,
> >>>> > vc_divisions,
> >>>> > nu_unit_factor,
> >>>> > vc_division_seg_code,
> >>>> > vc_color_comb,
> >>>> > dt_mod_date,
> >>>> > vc_inactive_product,
> >>>> > rn,
> >>>> > country
> >>>> >     )
> >>>> > VALUES (
> >>>> > s.vc_item_code,
> >>>> > s.vc_sub_group,
> >>>> > s.vc_product_name,
> >>>> > s.vc_product_unit,
> >>>> > s.vc_group_code,
> >>>> > s.dt_vat_start,
> >>>> > s.vc_band_code,
> >>>> > s.vc_semi_band_code,
> >>>> > s.vc_divisions,
> >>>> > s.nu_unit_factor,
> >>>> > s.vc_division_seg_code,
> >>>> > s.vc_color_comb,
> >>>> > s.dt_mod_date,
> >>>> > s.vc_inactive_product,
> >>>> > s.rn,
> >>>> > s.country
> >>>> > );
> >>>> >
> >>>> > Warm Regards

Re: Re: Re: Merge Operation Failing Results in this SQL Error [40000] [42000]

Reply via email to