Yes, the serializer just don't work with negative counters. That explains why the result is inaccurate and it's not a bug.
About the SUM auto-rewrite behavior, I have created jira: https://issues.apache.org/jira/browse/KYLIN-2620 Thank Shi ShaoFeng, Li Yang and Billy Liu for help resolving the questions. ________________________________ From: ShaoFeng Shi <[email protected]> Sent: Monday, May 15, 2017 10:36 AM To: user Subject: Re: Questions about SUM behavior when rewritten as TOPN Ok then it explains. Top N algorithm is based on the assumption that your data follows Zipf distribution (https://en.wikipedia.org/wiki/Zipf%27s_law). If not, the result can be very inaccurate. Your sample is such a case, with only 1 key and negative counters, the error be accumulated and finally see a variation from the real value, which is not bug I think. Please use SUM measure, instead of abusing TopN. [https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/Zipf_distribution_PMF.png/325px-Zipf_distribution_PMF.png]<https://en.wikipedia.org/wiki/Zipf%27s_law> Zipf's law - Wikipedia<https://en.wikipedia.org/wiki/Zipf%27s_law> en.wikipedia.org Zipf's law (/ ˈ z ɪ f /) is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and ... 2017-05-15 16:55 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>: We found something about serialization precision that might be related to the inaccurate result problem. the TopNCounterSerializer uses DoubleDeltaSerializer to serialize double array. The DoubleDeltaSerializer will write double values as int(d*1000+0.5)/1000 so for each serialization process, negative integer value increases by 0.001. Test case: (slightly modified from TopNCounterSerializerTest.testSerialization) @Test public void testNegativeSerialization() { TopNCounter<ByteArray> vs = new TopNCounter<ByteArray>(50); Integer[] stream = {1}; double[] incValues = {-1}; for (int counter = 0; counter < stream.length; ++counter) { Integer i = stream[counter]; double v = incValues[counter]; vs.offer(new ByteArray(Bytes.toBytes(i)), v); } vs.sortAndRetain(); ByteBuffer out = ByteBuffer.allocate(1024); serializer.serialize(vs, out); byte[] copyBytes = new byte[out.position()]; System.arraycopy(out.array(), 0, copyBytes, 0, out.position()); ByteBuffer in = ByteBuffer.wrap(copyBytes); TopNCounter<ByteArray> vsNew = serializer.deserialize(in); Assert.assertEquals(vs.toString(), vsNew.toString()); } and result is org.junit.ComparisonFailure: Expected :[\x00\x00\x00\x01:-1.0] Actual :[\x00\x00\x00\x01:-0.999] so if we set type of measure1 as double, and build the cube using a single line {measure1 = -1.0} the "select sum(measure1) from table group by dim2_id" will return "-0.994" instead of "-1" ________________________________ From: Tingmao Lin <[email protected]<mailto:[email protected]>> Sent: Monday, May 15, 2017 8:44 AM To: [email protected]<mailto:[email protected]> Subject: Re: Questions about SUM behavior when rewritten as TOPN The query log and the cube definition is here: Query without rewrite 2017-05-15 14:59:50,741 INFO [pool-8-thread-1] threadpool.DefaultScheduler:118 : Job Fetcher: 0 should running, 0 actual running, 0 ready, 577 already succeed, 19 error, 11 discarded, 0 others 2017-05-15 14:59:51,637 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:336 : Using project: TEST 2017-05-15 14:59:51,638 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:337 : The original query: select sum(v),count(v) from test group by lv2_id 2017-05-15 14:59:51,638 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:440 : The corrected query: select sum(v),count(v) from test group by lv2_id LIMIT 50000 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:48 : The project manager's reference is org.apache.kylin.metadata.project.ProjectManager@ac216f8 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and project=TEST : CUBE[name=TEST1ROWS1DIM] 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:51,674 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 73. 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RealizationSortRule, realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:51,675 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)] And the final chosen one is the first one 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[DEFAULT.TEST.LV2_ID], otherDimsD=[] 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:230 : GroupD :[DEFAULT.TEST.LV2_ID] 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[] 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns :[DEFAULT.TEST.LV2_ID] 2017-05-15 14:59:51,707 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.GTCubeStorageQueryBase:238 : Does not need storage aggregation 2017-05-15 14:59:51,708 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] storage.StorageContext:134 : Not enabling limit push down because the limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max 10000 2017-05-15 14:59:51,708 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment 29542_29543 2017-05-15 14:59:51,763 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 311 bytes, rawScanBytesString 40 bytes 2017-05-15 14:59:51,764 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] v2.CubeHBaseEndpointRPC:141 : The scan 6fdd1fd2 for segment TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard part of start/end key is set to 0 2017-05-15 14:59:51,764 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid exact match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key 2017-05-15 14:59:51,768 INFO [kylin-coproc--pool14-t14] v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 GTScanRequest 6fdd1fd2>Endpoint RPC returned from HTable KYLIN_EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time elapsed in EP: 1(ms). Server CPU usage: 0.04998115611708333, server physical mem left: 8.621008896E10, server swap mem left:0.0.Etc message: start latency: 58@0,agg done@1,compress done@1,server stats done@1, debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete: true. 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:576 : Scan count for each storageContext: 1, 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:359 : Stats of SQL response: isException: false, duration: 133, total scan count 1 2017-05-15 14:59:51,771 INFO [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] service.QueryService:284 : ==========================[QUERY]=============================== Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 SQL: select sum(v),count(v) from test group by lv2_id LIMIT 50000 User: ADMIN Success: true Duration: 0.133 Project: TEST Realization Names: [TEST1ROWS1DIM] Cuboid Ids: [1] Total scan count: 1 Result row count: 1 Accept Partial: true Is Partial Result: false Hit Exception Cache: false Storage cache used: false Message: null ==========================[QUERY]=============================== Query with rewrite: 2017-05-15 14:59:34,156 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:336 : Using project: TEST 2017-05-15 14:59:34,157 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:337 : The original query: select sum(v) from test group by lv2_id 2017-05-15 14:59:34,157 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:440 : The corrected query: select sum(v) from test group by lv2_id LIMIT 50000 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:48 : The project manager's reference is org.apache.kylin.metadata.project.ProjectManager@ac216f8 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and project=TEST : CUBE[name=TEST1ROWS1DIM] 2017-05-15 14:59:34,185 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] cube.CubeCapabilityChecker:191 : Cube CUBE[name=TEST1ROWS1DIM] CapabilityInfluences: TOP10@class org.apache.kylin.measure.topn.TopNMeasureType 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 21. 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.routing.rules.RealizationSortRule, realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)] 2017-05-15 14:59:34,186 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)] And the final chosen one is the first one 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] topn.TopNMeasureType:338 : Rewrite function FunctionDesc [expression=SUM, parameter=ParameterDesc [type=column, value=V, nextParam=null], returnType=bigint] to FunctionDesc [expression=TOP_N, parameter=ParameterDesc [type=column, value=V, nextParam=ParameterDesc [type=column, value=LV2_ID, nextParam=null]], returnType=topn(10)] 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[], otherDimsD=[] 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:230 : GroupD :[] 2017-05-15 14:59:34,217 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[] 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns :[DEFAULT.TEST.LV2_ID] 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:241 : Need storage aggregation 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] storage.StorageContext:134 : Not enabling limit push down because the limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max 10000 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.GTCubeStorageQueryBase:334 : Memory budget is set to 805306 rows 2017-05-15 14:59:34,218 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment 29542_29543 2017-05-15 14:59:34,219 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] dict.DictionaryManager:420 : DictionaryManager(1833236614) loading DictionaryInfo(loadDictObj:true) at /dict/DEFAULT.TEST/LV2_ID/7be3d6b1-0ed4-4b2d-9015-0c92b0757725.dict 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 306 bytes, rawScanBytesString 40 bytes 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] v2.CubeHBaseEndpointRPC:141 : The scan 37ca24c9 for segment TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard part of start/end key is set to 0 2017-05-15 14:59:34,221 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid require post aggregation, from 0 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key 2017-05-15 14:59:34,256 INFO [kylin-coproc--pool14-t13] v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query 54ac350a-7676-4601-a109-04dec543e56e GTScanRequest 37ca24c9>Endpoint RPC returned from HTable KYLIN_EBM57UYZMQ Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time elapsed in EP: 2(ms). Server CPU usage: 0.07331964572432702, server physical mem left: 8.619999232E10, server swap mem left:0.0.Etc message: start latency: 33@1,agg done@1,compress done@1,server stats done@2, debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete: true. 2017-05-15 14:59:34,259 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:576 : Scan count for each storageContext: 1, 2017-05-15 14:59:34,259 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:359 : Stats of SQL response: isException: false, duration: 102, total scan count 1 2017-05-15 14:59:34,260 INFO [Query 54ac350a-7676-4601-a109-04dec543e56e-108] service.QueryService:284 : ==========================[QUERY]=============================== Query Id: 54ac350a-7676-4601-a109-04dec543e56e SQL: select sum(v) from test group by lv2_id LIMIT 50000 User: ADMIN Success: true Duration: 0.102 Project: TEST Realization Names: [TEST1ROWS1DIM] Cuboid Ids: [1] Total scan count: 1 Result row count: 1 Accept Partial: true Is Partial Result: false Hit Exception Cache: false Storage cache used: false Message: null ==========================[QUERY]=============================== cube definition: { "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75", "last_modified": 1494573765171, "version": "1.6.0", "name": "TEST1ROWS1DIM", "model_name": "TEST", "description": "", "null_string": null, "dimensions": [ { "name": "LV2_ID", "table": "DEFAULT.TEST", "column": "LV2_ID", "derived": null }, { "name": "MINUTE_START", "table": "DEFAULT.TEST", "column": "MINUTE_START", "derived": null } ], "measures": [ { "name": "_COUNT_", "function": { "expression": "COUNT", "parameter": { "type": "constant", "value": "1", "next_parameter": null }, "returntype": "bigint" }, "dependent_measure_ref": null }, { "name": "COUNT", "function": { "expression": "SUM", "parameter": { "type": "column", "value": "V", "next_parameter": null }, "returntype": "bigint" }, "dependent_measure_ref": null }, { "name": "TOP10", "function": { "expression": "TOP_N", "parameter": { "type": "column", "value": "V", "next_parameter": { "type": "column", "value": "LV2_ID", "next_parameter": null } }, "returntype": "topn(10)", "configuration": { "topn.encoding.LV2_ID": "dict" } }, "dependent_measure_ref": null } ], "dictionaries": [], "rowkey": { "rowkey_columns": [ { "column": "MINUTE_START", "encoding": "time", "isShardBy": false }, { "column": "LV2_ID", "encoding": "dict", "isShardBy": false } ] }, "hbase_mapping": { "column_family": [ { "name": "F1", "columns": [ { "qualifier": "M", "measure_refs": [ "_COUNT_", "COUNT", "TOP10" ] } ] } ] }, "aggregation_groups": [ { "includes": [ "MINUTE_START", "LV2_ID" ], "select_rule": { "hierarchy_dims": [], "mandatory_dims": [], "joint_dims": [] } } ], "signature": "o75jqwjKPFF859EyU6j9eA==", "notify_list": [], "status_need_notify": [ "ERROR", "DISCARDED", "SUCCEED" ], "partition_date_start": 0, "partition_date_end": 3153600000000, "auto_merge_time_ranges": [ 604800000, 2419200000<tel:024%201920%200000> ], "retention_range": 0, "engine_type": 2, "storage_type": 2, "override_kylin_properties": {} } ________________________________ From: ShaoFeng Shi <[email protected]<mailto:[email protected]>> Sent: Monday, May 15, 2017 1:37 AM To: user Subject: Re: Questions about SUM behavior when rewritten as TOPN Hi Tingmao, Your finding is correct; Kylin didn't check whether there is "order by limit" pattern or not. The main reason is, the SQL parser didn't push down the "order by limit" info here (at least when TopN was developed), so I couldn't leverage that. Another reason is usually we won't put the "group and order" column as a normal cube dimension (but Kylin didn't stop you from doing that). Please feel free to open a JIRA on the checking "order by limit" pattern, I can check that later. 2017-05-15 2:37 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>: Yeah, we looked into the TopNCounter source code and found that for low cardinality scenario, in TopNCounter, m1=m2=0 so it just sum up the values. But the result still goes wrong. I will collect more information for investigation ASAP :) I have a question: as Billy Liu said in this thread kylin will check the ORDER BY clause to determine whether to rewrite. But I didn't find any access to SQLdigest.sortColumns in TopNMeasureType.influenceCapabilityCheck(). Does kylin check it elsewhere ? It seems that if sum(measure) is the only measure in the query and the group by column matches, then TopNMeasureType.isTopNCompatibleSum() will pass and the query get rewritten. This confuses the user since they may expect a accurate result for every distinct value of group by column(s). ________________________________ From: ShaoFeng Shi <[email protected]<mailto:[email protected]>> Sent: Sunday, May 14, 2017 2:13 PM To: user Subject: Re: Questions about SUM behavior when rewritten as TOPN Agree with Yang's points; When cardinality is small than the TopN counter's capacity, the result should be accurate. I checked the 1.6 source code and didn't find clue. Please share more information (cube definition and query logs) for investigation, thanks! For negtive number in TopN, actually that isn't recommended, as it goes against TopN's purpose, which is to counting something happened. When merging two TopN counters, one counter will use another's last element's number to accumulate (if another is full) on its elements (as a supplement). If the last element is close to 0, its impaction will be minor. But if the last element is a big negative value, you know it's impaction will be considerable! It doesn't make sense to reduce existing element's counting value if the last element's value is negative. So please use it properly in your scenario. Ofcourse, I think Kylin should also add more checks there. 2017-05-14 17:18 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>: The SQL in the original email is exactly what we input in the "insight" tab in kylin admin UI. I do not have access to the host running kylin now ,and I will post the detailed log output tomorrow. We reproduced the inaccurate result behavior using a source table with <10 rows and I should be able to write a reproduce step tomorrow. ________________________________ From: Billy Liu <[email protected]<mailto:[email protected]>> Sent: Sunday, May 14, 2017 12:21 AM To: user Subject: Re: Questions about SUM behavior when rewritten as TOPN Thanks Tingmao for the report. Could you show us the complete SQL? In your SQL, there is no order by statement. If no ORDER BY, the query should not be rewritten into TopN measure. 2017-05-12 23:52 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>: Hi, We found that SUM() query on a cardinality 1 dimension is not accurate (or "not correct") when automatically rewritten as TOPN. Is that the expected behavior of kylin or there are any other issue? We built a cube on a table ( measure1: bigint, dim1_id:varchar, dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source) The cube has two measures: SUM(measure1) and TOPN(10,sum-orderby(measure1),group by dim2_id) . (other measures omitted) and two dimensions dim1_id, dim2_id (other dims omitted) About the source table data: The cardinality of dim1_id is 1 (same dim1_id for all rows in the source table) The cardinality of dim2_id is 1 (same dim2_id for all rows in the source table) The possible value of measure1 is [1,0,-1] When we query "select SUM(measure1) FROM table GROUP BY dim2_id" => the result has one row:"sum=7", from the kylin logs we found that the query has been automatically rewritten as TOPN(measure1,sum-orderby(measure1),group by dim2_id) When we write another query to prevent TOPN rewrite, for example: "select SUM(measure1),count(*) FROM table GROUP BY dim2_id" => one row -- "sum=-2,count=24576" "select SUM(measure1),count(*) FROM table" => one row -- "sum=-2,count=24576" The result is different (7 and -2) when rewritting to TOPN or not. My question is: are the following behavior "works as expected" ,or TOPN algorithm does not support negative counter values very well , or any issue there? 1. SUM() query automatically rewritten as TOPN and gives approximated result when no TOPN present in the query. 2. When cardinality is 1, TOPN does not give accurate result. Thanks. -- Best regards, Shaofeng Shi 史少锋 -- Best regards, Shaofeng Shi 史少锋 -- Best regards, Shaofeng Shi 史少锋
