[ 
https://issues.apache.org/jira/browse/HIVE-23031?focusedWorklogId=429073&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-429073
 ]

ASF GitHub Bot logged work on HIVE-23031:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Apr/20 15:07
            Start Date: 30/Apr/20 15:07
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #988:
URL: https://github.com/apache/hive/pull/988#discussion_r418081401



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2465,6 +2465,19 @@ private static void 
populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
         "If the number of references to a CTE clause exceeds this threshold, 
Hive will materialize it\n" +
         "before executing the main query block. -1 will disable this 
feature."),
 
+    HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false,
+        "Enables query rewrites based on approximate functions(sketches)."),
+
+    
HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled",
+        true,
+        "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use 
sketch functions."),
+
+    HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH(
+        "hive.optimize.bi.rewrite.countdistinct.sketch", "hll",
+        new StringSet("hll", "cpc", "theta"),

Review comment:
       I understand for a single algorithm it will work. However, consider the 
following scenario:
   - A user enables BI mode and algorithm `hll`.
   - The user creates a MV with count distinct. The MV has stored the count 
distinct field using `hll`. The SQL statement still has count distinct.
   - We change default algorithm to `cpc` and restart HS2. Thus, when the MV is 
loaded by HS2, the count distinct is transformed to `cpc`.
   - The user runs a query with count distinct, which transforms to `cpc`, 
matches the MV... but fails at deserialization time because the sketch stored 
for the MV is `hll`.
   
   That is why I suggested we could limit the options for algorithms till we 
have proper support. The risk I see if we do not do that now is that if anyone 
creates MVs using the different default algorithms, we will not have any way to 
distinguish between them anymore.
   
   From the two choices that you mention above, I was suggesting the second 
option, since the main goal of the whole effort is to be able to use these 
algorithms seamlessly with the MVs. I agree it can be outside of the scope of 
this change, but let's limit the algorithm choices till then?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 429073)
    Time Spent: 3h  (was: 2h 50m)

> Add option to enable transparent rewrite of count(distinct) into sketch 
> functions
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-23031
>                 URL: https://issues.apache.org/jira/browse/HIVE-23031
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Zoltan Haindrich
>            Assignee: Zoltan Haindrich
>            Priority: Major
>         Attachments: HIVE-23031.01.patch, HIVE-23031.02.patch, 
> HIVE-23031.03.patch, HIVE-23031.03.patch, HIVE-23031.03.patch, 
> HIVE-23031.04.patch, HIVE-23031.04.patch
>
>          Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to