hello Okumin. After porting the patch HIVE-28428, I observed that the slow query problem disappeared.Currently the query efficiency for orc+zstd table is basically the same as orc+snappy. Therefore I strongly recommend addingHIVE-28428to version 4.0.1. Regards, lisoda
At 2024-09-02 23:28:25, "Okumin" <m...@okumin.com> wrote: >Hi Clinton, > >Thanks for sharing your problem. If you provide more information, such >as a dataset or queries, we can reproduce it and file the problem. > >Hi Lisoda, > >Thanks for giving us real examples. Interesting. Can I understand the >first problem that happens when there is a big data file in an Iceberg >table? I could not reproduce it[1], and I am curious about the >detailed conditions. As for the map-side aggregation, we also found >and resolved a similar problem[2]. It could be good to check. The >other issues are also interesting. I'd file tickets if I had evidence. > >- [1] https://gist.github.com/okumin/4fccec45109fc9927a22f40c166fe7f9 >- [2] https://issues.apache.org/jira/browse/HIVE-28428 > >Regards, >Okumin > >On Mon, Sep 2, 2024 at 2:59 PM clinton chikwata <clintonfun...@gmail.com> >wrote: >> >> Hello Lisoda, >> >> Thanks for this information. >> >> >> On Sun, Sep 1, 2024 at 4:04 PM lisoda <lis...@yeah.net> wrote: >>> >>> Hello Clinton: >>> >>> We have actually encountered the same issue where, in many cases, querying >>> Iceberg does not meet expected efficiency, falling short of regular >>> ORC/Parquet tables in speed. Since the current HiveIcebergInputSplit does >>> not support splits based on file size, reading can be slow when individual >>> data files in Iceberg are excessively large. This issue necessitates >>> improvements from community developers in future iterations. Additionally, >>> if Iceberg tables employ zstd compression, the current handling via Hive's >>> aircompress library, a Java library, is notably less efficient than JNI >>> implementations. This might only improve after a reconstruction leveraging >>> JDK-SIMD. Furthermore, we have analyzed execution latency using flame >>> graphs and discovered potential issues with the implementation of >>> VectorGroupByOperator$ProcessingModeHashAggregate, which exhibits >>> exceedingly poor performance. Consequently, as of now, with Iceberg tables, >>> we can temporarily address the issue by increasing the number of map-tasks >>> and reducing the size of individual data files in the Iceberg table. We >>> hope these issues can be resolved in subsequent iterative developments. >>> >>> >>> >>> >>> >>> >>> 在 2024-08-28 14:41:03,"clinton chikwata" <clintonfun...@gmail.com> 写道: >>> >>> Thanks Okumin. >>> >>> I am new to Hive and Tez and I have struggled to deploy a high-performance >>> Dockerized Hive setup. I followed the documentation for setting up a remote >>> Metastore. I have a single node with 32 GB of RAM and 8 cores, but I have a >>> dataset of about 2 GB (Iceberg table partitioned on one column). However, >>> when I run select queries, the performance has not been as fast as >>> expected. Could someone share some insights, especially regarding >>> hive-site.xml and Tez custom configuration? >>> >>> Any help would be appreciated. >>> >>> On Sun, Aug 4, 2024 at 4:46 PM Okumin <m...@okumin.com> wrote: >>>> >>>> Hi Clinton, >>>> >>>> I tested MERGE INTO with minimal reproduction. I saw the same error. >>>> >>>> ``` >>>> CREATE TABLE src (col1 INT, col2 INT); >>>> CREATE TABLE dst (id BIGINT DEFAULT SURROGATE_KEY(), col1 INT, col2 >>>> INT, PRIMARY KEY (id) DISABLE NOVALIDATE) STORED BY ICEBERG; >>>> >>>> MERGE INTO dst d USING src s ON s.col1 = d.col1 >>>> WHEN MATCHED THEN UPDATE SET col2 = s.col2 >>>> WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (s.col1, s.col2); >>>> ``` >>>> >>>> The following query, which explicitly inserts `id`, succeeded on my >>>> machine. The default keyword is unlikely to work on INSERT on MERGE >>>> INTO. I've yet to investigate whether ANSI allows us to omit it. >>>> >>>> ``` >>>> MERGE INTO dst d USING src s ON s.col1 = d.col1 >>>> WHEN MATCHED THEN UPDATE SET col2 = s.col2 >>>> WHEN NOT MATCHED THEN INSERT (id, col1, col2) VALUES (SURROGATE_KEY(), >>>> s.col1, s.col2); >>>> ``` >>>> >>>> As another point, the SURROGATE_KEY might not work as you expected. It >>>> doesn't generate globally unique ids on my attempts. >>>> >>>> Regards, >>>> Okumin >>>> >>>> On Wed, Jul 31, 2024 at 4:54 PM clinton chikwata >>>> <clintonfun...@gmail.com> wrote: >>>> > >>>> > Dear Team, >>>> > >>>> > Any help will be much appreciated. >>>> > >>>> > Error SQL Error [40000] [42000]: Error while compiling statement: >>>> > FAILED: SemanticException Schema of both sides of union should match. >>>> > >>>> > I have an ETL workload that stores data into temp_table with the schema >>>> > as shown below. >>>> > >>>> > CREATE EXTERNAL TABLE IF NOT EXISTS temp_table ( >>>> > VC_ITEM_CODE STRING, >>>> > VC_SUB_GROUP STRING, >>>> > VC_PRODUCT_NAME STRING, >>>> > VC_PRODUCT_UNIT STRING, >>>> > VC_GROUP_CODE STRING, >>>> > DT_VAT_START TIMESTAMP, >>>> > VC_BAND_CODE STRING, >>>> > VC_SEMI_BAND_CODE STRING, >>>> > VC_DIVISIONS STRING, >>>> > NU_UNIT_FACTOR DECIMAL(30, 0), >>>> > VC_DIVISION_SEG_CODE STRING, >>>> > VC_COLOR_COMB STRING, >>>> > DT_MOD_DATE TIMESTAMP, >>>> > VC_INACTIVE_PRODUCT STRING, >>>> > RN DECIMAL(10, 0), >>>> > country STRING >>>> > ) >>>> > STORED AS PARQUET >>>> > LOCATION 'S{path}' >>>> > >>>> > Then i want to load it to the final table >>>> > >>>> > CREATE TABLE product_dimension ( >>>> > `ID` BIGINT DEFAULT SURROGATE_KEY(), >>>> > VC_ITEM_CODE STRING, >>>> > VC_SUB_GROUP STRING, >>>> > VC_PRODUCT_NAME STRING, >>>> > VC_PRODUCT_UNIT STRING, >>>> > VC_GROUP_CODE STRING, >>>> > DT_VAT_START TIMESTAMP, >>>> > VC_BAND_CODE STRING, >>>> > VC_SEMI_BAND_CODE STRING, >>>> > VC_DIVISIONS STRING, >>>> > NU_UNIT_FACTOR DECIMAL(30, 0), >>>> > VC_DIVISION_SEG_CODE STRING, >>>> > VC_COLOR_COMB STRING, >>>> > DT_MOD_DATE TIMESTAMP, >>>> > VC_INACTIVE_PRODUCT STRING, >>>> > RN DECIMAL(10, 0), >>>> > country STRING, >>>> > PRIMARY KEY (ID) DISABLE NOVALIDATE) >>>> > STORED BY ICEBERG; >>>> > >>>> > When I attempt to perform a merge operation on column vc_item_code i >>>> > get the error as shown above : >>>> > >>>> > MERGE >>>> > INTO >>>> > product_dimension AS c >>>> > USING ( >>>> > SELECT >>>> > * >>>> > FROM >>>> > temp_table) AS s ON s.vc_item_code = c.vc_item_code >>>> > AND s.country = c.country >>>> > WHEN MATCHED THEN >>>> > UPDATE >>>> > SET >>>> > vc_item_code = s.vc_item_code, >>>> > vc_sub_group = s.vc_sub_group, >>>> > vc_product_name = s.vc_product_name, >>>> > vc_product_unit = s.vc_product_unit, >>>> > vc_group_code = s.vc_group_code, >>>> > dt_vat_start = s.dt_vat_start, >>>> > vc_band_code = s.vc_band_code, >>>> > vc_semi_band_code = s.vc_semi_band_code, >>>> > vc_divisions = s.vc_divisions, >>>> > nu_unit_factor = s.nu_unit_factor, >>>> > vc_division_seg_code = s.vc_division_seg_code, >>>> > vc_color_comb = s.vc_color_comb, >>>> > dt_mod_date = s.dt_mod_date, >>>> > vc_inactive_product = s.vc_inactive_product, >>>> > rn = s.rn, >>>> > country = s.country >>>> > WHEN NOT MATCHED THEN >>>> > INSERT >>>> > ( >>>> > vc_item_code, >>>> > vc_sub_group, >>>> > vc_product_name, >>>> > vc_product_unit, >>>> > vc_group_code, >>>> > dt_vat_start, >>>> > vc_band_code, >>>> > vc_semi_band_code, >>>> > vc_divisions, >>>> > nu_unit_factor, >>>> > vc_division_seg_code, >>>> > vc_color_comb, >>>> > dt_mod_date, >>>> > vc_inactive_product, >>>> > rn, >>>> > country >>>> > ) >>>> > VALUES ( >>>> > s.vc_item_code, >>>> > s.vc_sub_group, >>>> > s.vc_product_name, >>>> > s.vc_product_unit, >>>> > s.vc_group_code, >>>> > s.dt_vat_start, >>>> > s.vc_band_code, >>>> > s.vc_semi_band_code, >>>> > s.vc_divisions, >>>> > s.nu_unit_factor, >>>> > s.vc_division_seg_code, >>>> > s.vc_color_comb, >>>> > s.dt_mod_date, >>>> > s.vc_inactive_product, >>>> > s.rn, >>>> > s.country >>>> > ); >>>> > >>>> > Warm Regards