[jira] [Comment Edited] (HIVE-11133) Support hive.explain.user for Spark

Sahil Takiar (JIRA) Thu, 20 Apr 2017 18:30:41 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977912#comment-15977912
 ]


Sahil Takiar edited comment on HIVE-11133 at 4/21/17 1:29 AM:
--------------------------------------------------------------

[~xuefuz], [~lirui]

The qtest in the patch has a very similar query:

{code}
select sum(hash(a.k1,a.v1,a.k2, a.v2))
from (
select src1.key as k1, src1.value as v1, 
       src2.key as k2, src2.value as v2 FROM 
  (select * FROM src WHERE src.key < 10) src1 
    JOIN 
  (select * FROM src WHERE src.key < 10) src2
  SORT BY k1, v1, k2, v2
) a
{code}

It's also a mapjoin. The user-level explain output is:

{code}
Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT)
Reducer 3 <- Reducer 2 (GROUP)

Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_17]
        Group By Operator [GBY_15] (rows=1 width=8)
          Output:["_col0"],aggregations:["sum(VALUE._col0)"]
        <-Reducer 2 [GROUP]
          GROUP [RS_14]
            Group By Operator [GBY_13] (rows=1 width=8)
              
Output:["_col0"],aggregations:["sum(hash(_col0,_col1,_col2,_col3))"]
              Select Operator [SEL_11] (rows=27556 width=22)
                Output:["_col0","_col1","_col2","_col3"]
              <-Map 1 [PARTITION-LEVEL SORT]
                PARTITION-LEVEL SORT [RS_10]
                  Map Join Operator [MAPJOIN_20] (rows=27556 width=22)
                    Conds:(Inner),Output:["_col0","_col1","_col2","_col3"]
                  <-Select Operator [SEL_2] (rows=166 width=10)
                      Output:["_col0","_col1"]
                      Filter Operator [FIL_18] (rows=166 width=10)
                        predicate:(key < 10)
                        TableScan [TS_0] (rows=500 width=10)
                          
default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"]
                Map Reduce Local Work
        Stage-2
          Map 4
          keys: [HASHTABLESINK_22]
            Select Operator [SEL_5] (rows=166 width=10)
              Output:["_col0","_col1"]
              Filter Operator [FIL_19] (rows=166 width=10)
                predicate:(key < 10)
                TableScan [TS_3] (rows=500 width=10)
                  default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"]
          Map Reduce Local Work
{code}

The raw query plan looks like:

{code}
{
  "STAGE DEPENDENCIES": {
    "Stage-2": {
      "ROOT STAGE": "TRUE"
    },
    "Stage-1": {
      "DEPENDENT STAGES": "Stage-2"
    },
    "Stage-0": {
      "DEPENDENT STAGES": "Stage-1"
    }
  },
  "STAGE PLANS": {
    "Stage-2": {
      "Spark": {
        "Vertices:": {
          "Map 2": {
            "Map Operator Tree:": [
              {
                "TableScan": {
                  "Output:": [
                    "key",
                    "value"
                  ],
                  "_empty_": "default@myinput1,b,Tbl:COMPLETE,Col:NONE",
                  "Statistics:": "rows=3 width=8",
                  "OperatorId:": "TS_1",
                  "children": {
                    "keys:": {
                      "0": "key",
                      "1": "value",
                      "OperatorId:": "HASHTABLESINK_10"
                    }
                  }
                }
              }
            ],
            "Local Work:": {
              "Map Reduce Local Work": {
                
              }
            },
            "tag:": "0"
          }
        }
      }
    },
    "Stage-1": {
      "Spark": {
        "Vertices:": {
          "Map 1": {
            "Map Operator Tree:": [
              {
                "TableScan": {
                  "Output:": [
                    "key",
                    "value"
                  ],
                  "_empty_": "default@myinput1,a,Tbl:COMPLETE,Col:NONE",
                  "Statistics:": "rows=3 width=8",
                  "OperatorId:": "TS_0",
                  "children": {
                    "Map Join Operator": {
                      "condition map:": [
                        {
                          "_empty_": 
"{\"type\":\"Inner\",\"left\":0,\"right\":1}"
                        }
                      ],
                      "input vertices:": {
                        "1": "Map 2"
                      },
                      "keys:": {
                        "0": "key",
                        "1": "value"
                      },
                      "Output:": [
                        "_col0",
                        "_col1",
                        "_col5",
                        "_col6"
                      ],
                      "Statistics:": "rows=3 width=9",
                      "OperatorId:": "MAPJOIN_7",
                      "children": {
                        "Select Operator": {
                          "Output:": [
                            "_col0",
                            "_col1",
                            "_col2",
                            "_col3"
                          ],
                          "Statistics:": "rows=3 width=9",
                          "OperatorId:": "SEL_8",
                          "children": {
                            "File Output Operator": {
                              "Statistics:": "rows=3 width=9",
                              "OperatorId:": "FS_6"
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            ],
            "Local Work:": {
              "Map Reduce Local Work": {
                
              }
            },
            "tag:": "0"
          }
        }
      }
    },
    "Stage-0": {
      "Fetch Operator": {
        "limit:": "-1"
      }
    }
  },
  "cboInfo": "Plan not optimized by CBO due to missing feature 
[Less_than_equal_greater_than]."
}
{code}

So it looks like the map -> reduce dependency is there, map-4 (the hash table 
sink operator) -> reducer-2 (group by); does that sound correct?


was (Author: stakiar):
The qtest in the patch has a very similar query:

{code}
select sum(hash(a.k1,a.v1,a.k2, a.v2))
from (
select src1.key as k1, src1.value as v1, 
       src2.key as k2, src2.value as v2 FROM 
  (select * FROM src WHERE src.key < 10) src1 
    JOIN 
  (select * FROM src WHERE src.key < 10) src2
  SORT BY k1, v1, k2, v2
) a
{code}

It's also a mapjoin. The user-level explain output is:

{code}
Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (PARTITION-LEVEL SORT)
Reducer 3 <- Reducer 2 (GROUP)

Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_17]
        Group By Operator [GBY_15] (rows=1 width=8)
          Output:["_col0"],aggregations:["sum(VALUE._col0)"]
        <-Reducer 2 [GROUP]
          GROUP [RS_14]
            Group By Operator [GBY_13] (rows=1 width=8)
              
Output:["_col0"],aggregations:["sum(hash(_col0,_col1,_col2,_col3))"]
              Select Operator [SEL_11] (rows=27556 width=22)
                Output:["_col0","_col1","_col2","_col3"]
              <-Map 1 [PARTITION-LEVEL SORT]
                PARTITION-LEVEL SORT [RS_10]
                  Map Join Operator [MAPJOIN_20] (rows=27556 width=22)
                    Conds:(Inner),Output:["_col0","_col1","_col2","_col3"]
                  <-Select Operator [SEL_2] (rows=166 width=10)
                      Output:["_col0","_col1"]
                      Filter Operator [FIL_18] (rows=166 width=10)
                        predicate:(key < 10)
                        TableScan [TS_0] (rows=500 width=10)
                          
default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"]
                Map Reduce Local Work
        Stage-2
          Map 4
          keys: [HASHTABLESINK_22]
            Select Operator [SEL_5] (rows=166 width=10)
              Output:["_col0","_col1"]
              Filter Operator [FIL_19] (rows=166 width=10)
                predicate:(key < 10)
                TableScan [TS_3] (rows=500 width=10)
                  default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"]
          Map Reduce Local Work
{code}

The raw query plan looks like:

{code}
{
  "STAGE DEPENDENCIES": {
    "Stage-2": {
      "ROOT STAGE": "TRUE"
    },
    "Stage-1": {
      "DEPENDENT STAGES": "Stage-2"
    },
    "Stage-0": {
      "DEPENDENT STAGES": "Stage-1"
    }
  },
  "STAGE PLANS": {
    "Stage-2": {
      "Spark": {
        "Vertices:": {
          "Map 2": {
            "Map Operator Tree:": [
              {
                "TableScan": {
                  "Output:": [
                    "key",
                    "value"
                  ],
                  "_empty_": "default@myinput1,b,Tbl:COMPLETE,Col:NONE",
                  "Statistics:": "rows=3 width=8",
                  "OperatorId:": "TS_1",
                  "children": {
                    "keys:": {
                      "0": "key",
                      "1": "value",
                      "OperatorId:": "HASHTABLESINK_10"
                    }
                  }
                }
              }
            ],
            "Local Work:": {
              "Map Reduce Local Work": {
                
              }
            },
            "tag:": "0"
          }
        }
      }
    },
    "Stage-1": {
      "Spark": {
        "Vertices:": {
          "Map 1": {
            "Map Operator Tree:": [
              {
                "TableScan": {
                  "Output:": [
                    "key",
                    "value"
                  ],
                  "_empty_": "default@myinput1,a,Tbl:COMPLETE,Col:NONE",
                  "Statistics:": "rows=3 width=8",
                  "OperatorId:": "TS_0",
                  "children": {
                    "Map Join Operator": {
                      "condition map:": [
                        {
                          "_empty_": 
"{\"type\":\"Inner\",\"left\":0,\"right\":1}"
                        }
                      ],
                      "input vertices:": {
                        "1": "Map 2"
                      },
                      "keys:": {
                        "0": "key",
                        "1": "value"
                      },
                      "Output:": [
                        "_col0",
                        "_col1",
                        "_col5",
                        "_col6"
                      ],
                      "Statistics:": "rows=3 width=9",
                      "OperatorId:": "MAPJOIN_7",
                      "children": {
                        "Select Operator": {
                          "Output:": [
                            "_col0",
                            "_col1",
                            "_col2",
                            "_col3"
                          ],
                          "Statistics:": "rows=3 width=9",
                          "OperatorId:": "SEL_8",
                          "children": {
                            "File Output Operator": {
                              "Statistics:": "rows=3 width=9",
                              "OperatorId:": "FS_6"
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            ],
            "Local Work:": {
              "Map Reduce Local Work": {
                
              }
            },
            "tag:": "0"
          }
        }
      }
    },
    "Stage-0": {
      "Fetch Operator": {
        "limit:": "-1"
      }
    }
  },
  "cboInfo": "Plan not optimized by CBO due to missing feature 
[Less_than_equal_greater_than]."
}
{code}

So it looks like the map -> reduce dependency is there, map-4 (the hash table 
sink operator) -> reducer-2 (group by); does that sound correct?

> Support hive.explain.user for Spark
> -----------------------------------
>
>                 Key: HIVE-11133
>                 URL: https://issues.apache.org/jira/browse/HIVE-11133
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Mohit Sabharwal
>            Assignee: Sahil Takiar
>         Attachments: HIVE-11133.1.patch, HIVE-11133.2.patch, 
> HIVE-11133.3.patch, HIVE-11133.4.patch, HIVE-11133.5.patch, 
> HIVE-11133.6.patch, HIVE-11133.7.patch
>
>
> User friendly explain output ({{set hive.explain.user=true}}) should support 
> Spark as well. 
> Once supported, we should also enable related q-tests like {{explainuser_1.q}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (HIVE-11133) Support hive.explain.user for Spark

Reply via email to