Hello Lisoda,

Thanks for this information.


On Sun, Sep 1, 2024 at 4:04 PM lisoda <lis...@yeah.net> wrote:

> Hello Clinton:
>
> We have actually encountered the same issue where, in many cases, querying
> Iceberg does not meet expected efficiency, falling short of regular
> ORC/Parquet tables in speed. Since the current HiveIcebergInputSplit does
> not support splits based on file size, reading can be slow when individual
> data files in Iceberg are excessively large. This issue necessitates
> improvements from community developers in future iterations. Additionally,
> if Iceberg tables employ zstd compression, the current handling via Hive's
> aircompress library, a Java library, is notably less efficient than JNI
> implementations. This might only improve after a reconstruction leveraging
> JDK-SIMD. Furthermore, we have analyzed execution latency using flame
> graphs and discovered potential issues with the implementation of
> VectorGroupByOperator$ProcessingModeHashAggregate, which exhibits
> exceedingly poor performance. Consequently, as of now, with Iceberg tables,
> we can temporarily address the issue by increasing the number of map-tasks
> and reducing the size of individual data files in the Iceberg table. We
> hope these issues can be resolved in subsequent iterative developments.
>
>
>
>
>
> 在 2024-08-28 14:41:03,"clinton chikwata" <clintonfun...@gmail.com> 写道:
>
> Thanks  Okumin.
>
> I am new to Hive and Tez  and I have struggled to deploy a
> high-performance Dockerized Hive setup. I followed the documentation for
> setting up a remote Metastore. I have a single node with 32 GB of RAM and 8
> cores, but I have a dataset of about 2 GB (Iceberg table partitioned on one
> column). However, when I run select queries, the performance has not been
> as fast as expected. Could someone share some insights, especially
> regarding hive-site.xml and Tez custom configuration?
>
> Any help would be appreciated.
>
> On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote:
>
>> Hi Clinton,
>>
>> I tested MERGE INTO with minimal reproduction. I saw the same error.
>>
>> ```
>> CREATE TABLE src (col1 INT, col2 INT);
>> CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2
>> INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG;
>>
>> MERGE INTO dst d USING src s ON s.col1 = d.col1
>> WHEN MATCHED THEN UPDATE SET col2 = s.col2
>> WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2);
>> ```
>>
>> The following query, which explicitly inserts `id`, succeeded on my
>> machine. The default keyword is unlikely to work on INSERT on MERGE
>> INTO. I've yet to investigate whether ANSI allows us to omit it.
>>
>> ```
>> MERGE INTO dst d USING src s ON s.col1 = d.col1
>> WHEN MATCHED THEN UPDATE SET col2 = s.col2
>> WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(),
>> s.col1, s.col2);
>> ```
>>
>> As another point, the SURROGATE_KEY might not work as you expected. It
>> doesn't generate globally unique ids on my attempts.
>>
>> Regards,
>> Okumin
>>
>> On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata
>> <clintonfun...@gmail.com> wrote:
>> >
>> > Dear Team,
>> >
>> > Any help will be much appreciated.
>> >
>> > Error SQL Error [40000] [42000]: Error while compiling statement:
>> FAILED: SemanticException Schema of both sides of union should match.
>> >
>> > I have an ETL workload that stores data into temp_table with the schema
>> as shown below.
>> >
>> > CREATE EXTERNAL TABLE IF NOT EXISTS temp_table (
>> >     VC_ITEM_CODE STRING,
>> >     VC_SUB_GROUP STRING,
>> >     VC_PRODUCT_NAME STRING,
>> >     VC_PRODUCT_UNIT STRING,
>> >     VC_GROUP_CODE STRING,
>> >     DT_VAT_START TIMESTAMP,
>> >     VC_BAND_CODE STRING,
>> >     VC_SEMI_BAND_CODE STRING,
>> >     VC_DIVISIONS STRING,
>> >     NU_UNIT_FACTOR DECIMAL(30, 0),
>> >     VC_DIVISION_SEG_CODE STRING,
>> >     VC_COLOR_COMB STRING,
>> >     DT_MOD_DATE TIMESTAMP,
>> >     VC_INACTIVE_PRODUCT STRING,
>> >     RN DECIMAL(10, 0),
>> >     country STRING
>> >     )
>> > STORED AS PARQUET
>> > LOCATION 'S{path}'
>> >
>> > Then i want to load it to the final table
>> >
>> > CREATE TABLE product_dimension (
>> >    `ID` BIGINT DEFAULT SURROGATE_KEY(),
>> >     VC_ITEM_CODE STRING,
>> >     VC_SUB_GROUP STRING,
>> >     VC_PRODUCT_NAME STRING,
>> >     VC_PRODUCT_UNIT STRING,
>> >     VC_GROUP_CODE STRING,
>> >     DT_VAT_START TIMESTAMP,
>> >     VC_BAND_CODE STRING,
>> >     VC_SEMI_BAND_CODE STRING,
>> >     VC_DIVISIONS STRING,
>> >     NU_UNIT_FACTOR DECIMAL(30, 0),
>> >     VC_DIVISION_SEG_CODE STRING,
>> >     VC_COLOR_COMB STRING,
>> >     DT_MOD_DATE TIMESTAMP,
>> >     VC_INACTIVE_PRODUCT STRING,
>> >     RN DECIMAL(10, 0),
>> >     country STRING,
>> >     PRIMARY KEY (ID) DISABLE NOVALIDATE)
>> > STORED BY ICEBERG;
>> >
>> > When I attempt to perform a merge operation on column  vc_item_code i
>> get the error as shown above :
>> >
>> > MERGE
>> > INTO
>> > product_dimension AS c
>> > USING (
>> > SELECT
>> > *
>> > FROM
>> > temp_table) AS s ON  s.vc_item_code = c.vc_item_code
>> > AND s.country = c.country
>> > WHEN MATCHED THEN
>> > UPDATE
>> > SET
>> > vc_item_code = s.vc_item_code,
>> > vc_sub_group = s.vc_sub_group,
>> > vc_product_name = s.vc_product_name,
>> > vc_product_unit = s.vc_product_unit,
>> > vc_group_code = s.vc_group_code,
>> > dt_vat_start = s.dt_vat_start,
>> > vc_band_code = s.vc_band_code,
>> > vc_semi_band_code = s.vc_semi_band_code,
>> > vc_divisions = s.vc_divisions,
>> > nu_unit_factor = s.nu_unit_factor,
>> > vc_division_seg_code = s.vc_division_seg_code,
>> > vc_color_comb = s.vc_color_comb,
>> > dt_mod_date = s.dt_mod_date,
>> > vc_inactive_product = s.vc_inactive_product,
>> > rn = s.rn,
>> > country = s.country
>> > WHEN NOT MATCHED THEN
>> > INSERT
>> > (
>> >     vc_item_code,
>> > vc_sub_group,
>> > vc_product_name,
>> > vc_product_unit,
>> > vc_group_code,
>> > dt_vat_start,
>> > vc_band_code,
>> > vc_semi_band_code,
>> > vc_divisions,
>> > nu_unit_factor,
>> > vc_division_seg_code,
>> > vc_color_comb,
>> > dt_mod_date,
>> > vc_inactive_product,
>> > rn,
>> > country
>> >     )
>> > VALUES (
>> > s.vc_item_code,
>> > s.vc_sub_group,
>> > s.vc_product_name,
>> > s.vc_product_unit,
>> > s.vc_group_code,
>> > s.dt_vat_start,
>> > s.vc_band_code,
>> > s.vc_semi_band_code,
>> > s.vc_divisions,
>> > s.nu_unit_factor,
>> > s.vc_division_seg_code,
>> > s.vc_color_comb,
>> > s.dt_mod_date,
>> > s.vc_inactive_product,
>> > s.rn,
>> > s.country
>> > );
>> >
>> > Warm Regards
>>
>

Reply via email to