[ https://issues.apache.org/jira/browse/HIVE-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223741#comment-14223741 ]
Hari Sankar Sivarama Subramaniyan commented on HIVE-7751: --------------------------------------------------------- Revisiting this issue. > Mapjoin set in a non-conditional task can fail in MR mode because of memory > overhead issues > --------------------------------------------------------------------------------------------- > > Key: HIVE-7751 > URL: https://issues.apache.org/jira/browse/HIVE-7751 > Project: Hive > Issue Type: Bug > Reporter: Hari Sankar Sivarama Subramaniyan > Assignee: Hari Sankar Sivarama Subramaniyan > > select sum(ss_quantity) from store_sales join store on store.s_store_sk = > store_sales.ss_store_sk join customer_demographics on > customer_demographics.cd_demo_sk = store_sales.ss_cdemo_sk join > customer_address on store_sales.ss_addr_sk = customer_address.ca_address_sk > join date_dim on store_sales.ss_sold_date_sk = date_dim.d_date_sk where > d_year = 2000 and ((cd_marital_status = 'M' and cd_education_status = > 'Advanced Degree' and ss_sales_price between 100.00 and 150.00) or > (cd_marital_status = 'M' and cd_education_status = 'Advanced Degree' and > ss_sales_price between 50.00 and 100.00) or (cd_marital_status = 'M' and > cd_education_status = 'Advanced Degree' and ss_sales_price between 150.00 and > 200.00)) and ((ca_country = 'United States' and ca_state in ('TX', 'OH', > 'TX') and ss_net_profit between 0 and 2000) or (ca_country = 'United States' > and ca_state in ('OR', 'MN', 'KY') and ss_net_profit between 150 and 3000) or > (ca_country = 'United States' and ca_state in ('VA', 'TX', 'MS') and > ss_net_profit between 50 and 25000)); > The above query where the data is stored as orc format can fail because we > convert the above join to a non-conditional task assuming that mapjoin would > succeed at runtime. But at runtime, the query can fail due to memory overhead > issues. The improvement to prevent such failures would be to use table > statistics instead of calling ql.exec.Utilities.getTotalInputFileSize() > inside the CommonJoinTaskDispatcher. This would make sure that we take better > decisions for MR mode. Tez on the other hand would handle such scenarios > better because it actaully relies on table stats to get the data size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)