[ https://issues.apache.org/jira/browse/HIVE-23031?focusedWorklogId=429073&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-429073 ]
ASF GitHub Bot logged work on HIVE-23031: ----------------------------------------- Author: ASF GitHub Bot Created on: 30/Apr/20 15:07 Start Date: 30/Apr/20 15:07 Worklog Time Spent: 10m Work Description: jcamachor commented on a change in pull request #988: URL: https://github.com/apache/hive/pull/988#discussion_r418081401 ########## File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java ########## @@ -2465,6 +2465,19 @@ private static void populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal "If the number of references to a CTE clause exceeds this threshold, Hive will materialize it\n" + "before executing the main query block. -1 will disable this feature."), + HIVE_OPTIMIZE_BI_ENABLED("hive.optimize.bi.enabled", false, + "Enables query rewrites based on approximate functions(sketches)."), + + HIVE_OPTIMIZE_BI_REWRITE_COUNTDISTINCT_ENABLED("hive.optimize.bi.rewrite.countdistinct.enabled", + true, + "Enables to rewrite COUNT(DISTINCT(X)) queries to be rewritten to use sketch functions."), + + HIVE_OPTIMIZE_BI_REWRITE_COUNT_DISTINCT_SKETCH( + "hive.optimize.bi.rewrite.countdistinct.sketch", "hll", + new StringSet("hll", "cpc", "theta"), Review comment: I understand for a single algorithm it will work. However, consider the following scenario: - A user enables BI mode and algorithm `hll`. - The user creates a MV with count distinct. The MV has stored the count distinct field using `hll`. The SQL statement still has count distinct. - We change default algorithm to `cpc` and restart HS2. Thus, when the MV is loaded by HS2, the count distinct is transformed to `cpc`. - The user runs a query with count distinct, which transforms to `cpc`, matches the MV... but fails at deserialization time because the sketch stored for the MV is `hll`. That is why I suggested we could limit the options for algorithms till we have proper support. The risk I see if we do not do that now is that if anyone creates MVs using the different default algorithms, we will not have any way to distinguish between them anymore. From the two choices that you mention above, I was suggesting the second option, since the main goal of the whole effort is to be able to use these algorithms seamlessly with the MVs. I agree it can be outside of the scope of this change, but let's limit the algorithm choices till then? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 429073) Time Spent: 3h (was: 2h 50m) > Add option to enable transparent rewrite of count(distinct) into sketch > functions > --------------------------------------------------------------------------------- > > Key: HIVE-23031 > URL: https://issues.apache.org/jira/browse/HIVE-23031 > Project: Hive > Issue Type: Sub-task > Reporter: Zoltan Haindrich > Assignee: Zoltan Haindrich > Priority: Major > Attachments: HIVE-23031.01.patch, HIVE-23031.02.patch, > HIVE-23031.03.patch, HIVE-23031.03.patch, HIVE-23031.03.patch, > HIVE-23031.04.patch, HIVE-23031.04.patch > > Time Spent: 3h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)