Hello Lisoda, Thanks for this information.
On Sun, Sep 1, 2024 at 4:04 PM lisoda <lis...@yeah.net> wrote: > Hello Clinton: > > We have actually encountered the same issue where, in many cases, querying > Iceberg does not meet expected efficiency, falling short of regular > ORC/Parquet tables in speed. Since the current HiveIcebergInputSplit does > not support splits based on file size, reading can be slow when individual > data files in Iceberg are excessively large. This issue necessitates > improvements from community developers in future iterations. Additionally, > if Iceberg tables employ zstd compression, the current handling via Hive's > aircompress library, a Java library, is notably less efficient than JNI > implementations. This might only improve after a reconstruction leveraging > JDK-SIMD. Furthermore, we have analyzed execution latency using flame > graphs and discovered potential issues with the implementation of > VectorGroupByOperator$ProcessingModeHashAggregate, which exhibits > exceedingly poor performance. Consequently, as of now, with Iceberg tables, > we can temporarily address the issue by increasing the number of map-tasks > and reducing the size of individual data files in the Iceberg table. We > hope these issues can be resolved in subsequent iterative developments. > > > > > > 在 2024-08-28 14:41:03,"clinton chikwata" <clintonfun...@gmail.com> 写道: > > Thanks Okumin. > > I am new to Hive and Tez and I have struggled to deploy a > high-performance Dockerized Hive setup. I followed the documentation for > setting up a remote Metastore. I have a single node with 32 GB of RAM and 8 > cores, but I have a dataset of about 2 GB (Iceberg table partitioned on one > column). However, when I run select queries, the performance has not been > as fast as expected. Could someone share some insights, especially > regarding hive-site.xml and Tez custom configuration? > > Any help would be appreciated. > > On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote: > >> Hi Clinton, >> >> I tested MERGE INTO with minimal reproduction. I saw the same error. >> >> ``` >> CREATE TABLE src (col1 INT, col2 INT); >> CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2 >> INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG; >> >> MERGE INTO dst d USING src s ON s.col1 = d.col1 >> WHEN MATCHED THEN UPDATE SET col2 = s.col2 >> WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2); >> ``` >> >> The following query, which explicitly inserts `id`, succeeded on my >> machine. The default keyword is unlikely to work on INSERT on MERGE >> INTO. I've yet to investigate whether ANSI allows us to omit it. >> >> ``` >> MERGE INTO dst d USING src s ON s.col1 = d.col1 >> WHEN MATCHED THEN UPDATE SET col2 = s.col2 >> WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(), >> s.col1, s.col2); >> ``` >> >> As another point, the SURROGATE_KEY might not work as you expected. It >> doesn't generate globally unique ids on my attempts. >> >> Regards, >> Okumin >> >> On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata >> <clintonfun...@gmail.com> wrote: >> > >> > Dear Team, >> > >> > Any help will be much appreciated. >> > >> > Error SQL Error [40000] [42000]: Error while compiling statement: >> FAILED: SemanticException Schema of both sides of union should match. >> > >> > I have an ETL workload that stores data into temp_table with the schema >> as shown below. >> > >> > CREATE EXTERNAL TABLE IF NOT EXISTS temp_table ( >> > VC_ITEM_CODE STRING, >> > VC_SUB_GROUP STRING, >> > VC_PRODUCT_NAME STRING, >> > VC_PRODUCT_UNIT STRING, >> > VC_GROUP_CODE STRING, >> > DT_VAT_START TIMESTAMP, >> > VC_BAND_CODE STRING, >> > VC_SEMI_BAND_CODE STRING, >> > VC_DIVISIONS STRING, >> > NU_UNIT_FACTOR DECIMAL(30, 0), >> > VC_DIVISION_SEG_CODE STRING, >> > VC_COLOR_COMB STRING, >> > DT_MOD_DATE TIMESTAMP, >> > VC_INACTIVE_PRODUCT STRING, >> > RN DECIMAL(10, 0), >> > country STRING >> > ) >> > STORED AS PARQUET >> > LOCATION 'S{path}' >> > >> > Then i want to load it to the final table >> > >> > CREATE TABLE product_dimension ( >> > `ID` BIGINT DEFAULT SURROGATE_KEY(), >> > VC_ITEM_CODE STRING, >> > VC_SUB_GROUP STRING, >> > VC_PRODUCT_NAME STRING, >> > VC_PRODUCT_UNIT STRING, >> > VC_GROUP_CODE STRING, >> > DT_VAT_START TIMESTAMP, >> > VC_BAND_CODE STRING, >> > VC_SEMI_BAND_CODE STRING, >> > VC_DIVISIONS STRING, >> > NU_UNIT_FACTOR DECIMAL(30, 0), >> > VC_DIVISION_SEG_CODE STRING, >> > VC_COLOR_COMB STRING, >> > DT_MOD_DATE TIMESTAMP, >> > VC_INACTIVE_PRODUCT STRING, >> > RN DECIMAL(10, 0), >> > country STRING, >> > PRIMARY KEY (ID) DISABLE NOVALIDATE) >> > STORED BY ICEBERG; >> > >> > When I attempt to perform a merge operation on column vc_item_code i >> get the error as shown above : >> > >> > MERGE >> > INTO >> > product_dimension AS c >> > USING ( >> > SELECT >> > * >> > FROM >> > temp_table) AS s ON s.vc_item_code = c.vc_item_code >> > AND s.country = c.country >> > WHEN MATCHED THEN >> > UPDATE >> > SET >> > vc_item_code = s.vc_item_code, >> > vc_sub_group = s.vc_sub_group, >> > vc_product_name = s.vc_product_name, >> > vc_product_unit = s.vc_product_unit, >> > vc_group_code = s.vc_group_code, >> > dt_vat_start = s.dt_vat_start, >> > vc_band_code = s.vc_band_code, >> > vc_semi_band_code = s.vc_semi_band_code, >> > vc_divisions = s.vc_divisions, >> > nu_unit_factor = s.nu_unit_factor, >> > vc_division_seg_code = s.vc_division_seg_code, >> > vc_color_comb = s.vc_color_comb, >> > dt_mod_date = s.dt_mod_date, >> > vc_inactive_product = s.vc_inactive_product, >> > rn = s.rn, >> > country = s.country >> > WHEN NOT MATCHED THEN >> > INSERT >> > ( >> > vc_item_code, >> > vc_sub_group, >> > vc_product_name, >> > vc_product_unit, >> > vc_group_code, >> > dt_vat_start, >> > vc_band_code, >> > vc_semi_band_code, >> > vc_divisions, >> > nu_unit_factor, >> > vc_division_seg_code, >> > vc_color_comb, >> > dt_mod_date, >> > vc_inactive_product, >> > rn, >> > country >> > ) >> > VALUES ( >> > s.vc_item_code, >> > s.vc_sub_group, >> > s.vc_product_name, >> > s.vc_product_unit, >> > s.vc_group_code, >> > s.dt_vat_start, >> > s.vc_band_code, >> > s.vc_semi_band_code, >> > s.vc_divisions, >> > s.nu_unit_factor, >> > s.vc_division_seg_code, >> > s.vc_color_comb, >> > s.dt_mod_date, >> > s.vc_inactive_product, >> > s.rn, >> > s.country >> > ); >> > >> > Warm Regards >> >