[ https://issues.apache.org/jira/browse/HIVE-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17924543#comment-17924543 ]
Paramvir Singh commented on HIVE-28735: --------------------------------------- My setup - 1 Primary - 32 vCore, 256 GiB memory, - 2 core nodes - 4 vCore, 16 GiB memory - Data : https://drive.google.com/drive/folders/14KEaeRRlYCVhHKqohuql7kO7_HxZwdqO?usp=drive_link Repro scenario DDLs {code:java} CREATE EXTERNAL TABLE `catalog_sales_small2`( `cs_sold_date_sk` int, `cs_sold_time_sk` int, `cs_ship_date_sk` int, `cs_bill_customer_sk` int, `cs_bill_cdemo_sk` int, `cs_bill_hdemo_sk` int, `cs_bill_addr_sk` int, `cs_ship_customer_sk` int, `cs_ship_cdemo_sk` int, `cs_ship_hdemo_sk` int, `cs_ship_addr_sk` int, `cs_call_center_sk` int, `cs_catalog_page_sk` int, `cs_ship_mode_sk` int, `cs_warehouse_sk` int, `cs_item_sk` int, `cs_promo_sk` int, `cs_order_number` bigint, `cs_quantity` int, `cs_wholesale_cost` decimal(7,2), `cs_list_price` decimal(7,2), `cs_sales_price` decimal(7,2), `cs_ext_discount_amt` decimal(7,2), `cs_ext_sales_price` decimal(7,2), `cs_ext_wholesale_cost` decimal(7,2), `cs_ext_list_price` decimal(7,2), `cs_ext_tax` decimal(7,2), `cs_coupon_amt` decimal(7,2), `cs_ext_ship_cost` decimal(7,2), `cs_net_paid` decimal(7,2), `cs_net_paid_inc_tax` decimal(7,2), `cs_net_paid_inc_ship` decimal(7,2), `cs_net_paid_inc_ship_tax` decimal(7,2), `cs_net_profit` decimal(7,2)) STORED AS ORC LOCATION 's3://prmsingh-hive/hive4/catalog_sales_small2/'; CREATE EXTERNAL TABLE `customer_small`( `c_customer_sk` int, `c_customer_id` string, `c_current_cdemo_sk` int, `c_current_hdemo_sk` int, `c_current_addr_sk` int, `c_first_shipto_date_sk` int, `c_first_sales_date_sk` int, `c_salutation` string, `c_first_name` string, `c_last_name` string, `c_preferred_cust_flag` string, `c_birth_day` int, `c_birth_month` int, `c_birth_year` int, `c_birth_country` string, `c_login` string, `c_email_address` string, `c_last_review_date` string) STORED AS ORC LOCATION 's3://prmsingh-hive/hive4/customer_small/'; CREATE EXTERNAL TABLE `customer_address_small`( `ca_address_sk` int, `ca_address_id` string, `ca_street_number` string, `ca_street_name` string, `ca_street_type` string, `ca_suite_number` string, `ca_city` string, `ca_county` string, `ca_state` string, `ca_zip` string, `ca_country` string, `ca_gmt_offset` decimal(5,2), `ca_location_type` string) STORED AS ORC LOCATION 's3://prmsingh-hive/hive4/customer_address_small/'; {code} Then run the following query {code:java} select ca_zip, count(*) from catalog_sales_small2, customer_small, customer_address_small where cs_bill_customer_sk = c_customer_sk and c_current_addr_sk = ca_address_sk group by ca_zip order by ca_zip limit 100; {code} Case 1: {code:java} hive.auto.convert.join.noconditionaltask.size=1073741824 hive.vectorized.execution.enabled=true hive.auto.convert.join=true hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true {code} Results are wrong and random on different runs Case 2 : When any of the above property is set to false. It'll give correct results which are as below. {code:java} +---------+--------+ | ca_zip | _c1 | +---------+--------+ | 00601 | 550 | | 00608 | 516 | | 00626 | 382 | | 00629 | 12 | | 00649 | 40 | | 00656 | 25 | | 00659 | 9811 | | 00668 | 14 | | 00669 | 10559 | | 00716 | 616 | | 00725 | 561 | | 00728 | 14 | | 00741 | 491 | | 00749 | 4 | | 00750 | 9976 | | 00756 | 4 | | 00762 | 9764 | | 00763 | 12 | | 00764 | 575 | | 00769 | 10538 | | 00791 | 10021 | | 00793 | 37 | | 00794 | 569 | | 00804 | 25 | | 00816 | 497 | | 00824 | 6 | | 00838 | 18 | | 00854 | 13 | | 00862 | 512 | | 00868 | 376 | | 00878 | 18 | | 00896 | 692 | | 00897 | 11 | | 00902 | 554 | | 00907 | 46 | | 00909 | 525 | | 00911 | 21 | | 00914 | 46 | | 00917 | 13 | | 00919 | 550 | | 00920 | 28 | | 00930 | 38 | | 00936 | 29 | | 00944 | 609 | | 00948 | 31 | | 00969 | 445 | | 00981 | 7 | | 00983 | 24 | | 00986 | 21 | | 00999 | 9840 | | 01008 | 361 | | 01011 | 11210 | | 01014 | 11 | | 01018 | 415 | | 01019 | 469 | | 01028 | 9 | | 01044 | 32 | | 01051 | 430 | | 01081 | 29 | | 01085 | 22 | | 01092 | 544 | | 01099 | 10428 | | 01114 | 13 | | 01119 | 461 | | 01122 | 8 | | 01125 | 535 | | 01126 | 25 | | 01134 | 10346 | | 01140 | 490 | | 01173 | 521 | | 01180 | 23 | | 01184 | 22 | | 01186 | 24 | | 01187 | 10502 | | 01189 | 369 | | 01218 | 625 | | 01231 | 389 | | 01234 | 572 | | 01235 | 467 | | 01236 | 41 | | 01247 | 48 | | 01262 | 17 | | 01263 | 8 | | 01279 | 18 | | 01281 | 38 | | 01289 | 9 | | 01302 | 19 | | 01325 | 466 | | 01328 | 7 | | 01332 | 501 | | 01344 | 510 | | 01357 | 4 | | 01365 | 14 | | 01366 | 496 | | 01417 | 18 | | 01451 | 14 | | 01452 | 541 | | 01454 | 555 | | 01458 | 77 | | 01470 | 27 | +---------+--------+ {code} > TPCDS queries q15, q19 are failing when > hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled is set to > true > ----------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-28735 > URL: https://issues.apache.org/jira/browse/HIVE-28735 > Project: Hive > Issue Type: Sub-task > Components: Vectorization > Affects Versions: 4.0.0, 4.0.1 > Reporter: Paramvir Singh > Priority: Major > Labels: hive-4.1.0-must > > TPCDS queries q15, q19 are failing when > hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled is set to > true. > Setup should include atleast 2 node cluster. It's passing when the cluster > has only 1 node. > The wrong result is also random(on each run I get different random wrong > values). > Small repro query on TPCDS dataset > {code:java} > select ca_zip, count(*) > from catalog_sales_small, customer_small, customer_address_small > where cs_bill_customer_sk = c_customer_sk > and c_current_addr_sk = ca_address_sk > group by ca_zip > order by ca_zip > limit 100; > {code} > If we set the following properties, we get correct results > {code:java} > set hive.vectorized.execution.enabled=false; - Correct results > {code} > OR > {code:java} > set hive.auto.convert.join=false; - Correct results > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)