Re: Questions about SUM behavior when rewritten as TOPN

ShaoFeng Shi Mon, 15 May 2017 03:39:08 -0700

Ok then it explains.

Top N algorithm is based on the assumption that your data follows Zipf
distribution (https://en.wikipedia.org/wiki/Zipf%27s_law). If not, the
result can be very inaccurate. Your sample is such a case, with only 1 key
and negative counters, the error be accumulated and finally see a variation
from the real value, which is not bug I think. Please use SUM measure,
instead of abusing TopN.


2017-05-15 16:55 GMT+08:00 Tingmao Lin <[email protected]>:

> We found something about serialization precision that might be related to
> the inaccurate result problem.
>
>
> the TopNCounterSerializer uses DoubleDeltaSerializer to serialize double
> array.  The DoubleDeltaSerializer will write double values as
>  int(d*1000+0.5)/1000 so for each serialization process, negative integer
> value increases by 0.001.
>
>
> Test case:    (slightly modified from TopNCounterSerializerTest.
> testSerialization)
>
>
> @Test
> public void testNegativeSerialization() {
>     TopNCounter<ByteArray> vs = new TopNCounter<ByteArray>(50);
>     Integer[] stream = {1};
>     double[] incValues = {-1};
>     for (int counter = 0; counter < stream.length; ++counter) {
>         Integer i = stream[counter]; double v = incValues[counter];
>         vs.offer(new ByteArray(Bytes.toBytes(i)), v);
>     }
>     vs.sortAndRetain();
>     ByteBuffer out = ByteBuffer.allocate(1024);
>     serializer.serialize(vs, out);
>
>     byte[] copyBytes = new byte[out.position()];
>     System.arraycopy(out.array(), 0, copyBytes, 0, out.position());
>
>     ByteBuffer in = ByteBuffer.wrap(copyBytes);
>     TopNCounter<ByteArray> vsNew = serializer.deserialize(in);
>
>     Assert.assertEquals(vs.toString(), vsNew.toString());
> }
>
>
> and result is
>
> org.junit.ComparisonFailure:
> Expected :[\x00\x00\x00\x01:-1.0]
> Actual   :[\x00\x00\x00\x01:-0.999]
>
>
> so if we set type of measure1 as double, and build the cube using a single
> line {measure1 = -1.0}
> the "select sum(measure1) from table group by dim2_id"  will return
>  "-0.994" instead of "-1"
>
>
> ------------------------------
> *From:* Tingmao Lin <[email protected]>
> *Sent:* Monday, May 15, 2017 8:44 AM
> *To:* [email protected]
>
> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>
>
> The query log and the cube definition is here:
>
>
>
> Query without rewrite
> 2017-05-15 14:59:50,741 INFO  [pool-8-thread-1]
> threadpool.DefaultScheduler:118 : Job Fetcher: 0 should running, 0 actual
> running, 0 ready, 577 already succeed, 19 error, 11 discarded, 0 others
> 2017-05-15 14:59:51,637 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> service.QueryService:336 : Using project: TEST
> 2017-05-15 14:59:51,638 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> service.QueryService:337 : The original query:  select sum(v),count(v) from
> test group by lv2_id
> 2017-05-15 14:59:51,638 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> service.QueryService:440 : The corrected query: select sum(v),count(v) from
> test group by lv2_id
> LIMIT 50000
> 2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> routing.QueryRouter:48 : The project manager's reference is
> org.apache.kylin.metadata.project.ProjectManager@ac216f8
> 2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and
> project=TEST : CUBE[name=TEST1ROWS1DIM]
> 2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.
> routing.rules.RemoveBlackoutRealizationsRule, realizations before:
> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
> 2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.
> routing.rules.RemoveUncapableRealizationsRule, realizations before:
> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
> 2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 73.
> 2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.
> routing.rules.RealizationSortRule, realizations before:
> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
> 2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)]
> And the final chosen one is the first one
> 2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified:
> cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[DEFAULT.TEST.LV2_ID], otherDimsD=[]
> 2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> gtrecord.GTCubeStorageQueryBase:230 : GroupD :[DEFAULT.TEST.LV2_ID]
> 2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[]
> 2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns
> :[DEFAULT.TEST.LV2_ID]
> 2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> gtrecord.GTCubeStorageQueryBase:238 : Does not need storage aggregation
> 2017-05-15 14:59:51,708 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> storage.StorageContext:134 : Not enabling limit push down because the
> limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max
> 10000
> 2017-05-15 14:59:51,708 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment
> 29542_29543
> 2017-05-15 14:59:51,763 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 311 bytes,
> rawScanBytesString 40 bytes
> 2017-05-15 14:59:51,764 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> v2.CubeHBaseEndpointRPC:141 : The scan 6fdd1fd2 for segment
> TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard
> part of start/end key is set to 0
> 2017-05-15 14:59:51,764 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid exact
> match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00
> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:
>  \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00
> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
> 2017-05-15 14:59:51,768 INFO  [kylin-coproc--pool14-t14]
> v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query 
> cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20
> GTScanRequest 6fdd1fd2>Endpoint RPC returned from HTable KYLIN_EBM57UYZMQ
> Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\
> x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\
> x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\
> x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\
> x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row:
> 1. Total filtered/aggred row: 0. Time elapsed in EP: 1(ms). Server CPU
> usage: 0.04998115611708333, server physical mem left: 8.621008896E10,
> server swap mem left:0.0.Etc message: start latency: 58@0,agg done@1,compress
> done@1,server stats done@1, 
> debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal
> Complete: true.
> 2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> service.QueryService:576 : Scan count for each storageContext: 1,
> 2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> service.QueryService:359 : Stats of SQL response: isException: false,
> duration: 133, total scan count 1
> 2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104]
> service.QueryService:284 :
> ==========================[QUERY]===============================
> Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20
> SQL: select sum(v),count(v) from test group by lv2_id
> LIMIT 50000
> User: ADMIN
> Success: true
> Duration: 0.133
> Project: TEST
> Realization Names: [TEST1ROWS1DIM]
> Cuboid Ids: [1]
> Total scan count: 1
> Result row count: 1
> Accept Partial: true
> Is Partial Result: false
> Hit Exception Cache: false
> Storage cache used: false
> Message: null
> ==========================[QUERY]===============================
>
>
>
>
>
> Query with rewrite:
> 2017-05-15 14:59:34,156 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> service.QueryService:336 : Using project: TEST
> 2017-05-15 14:59:34,157 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> service.QueryService:337 : The original query:  select sum(v) from test
> group by lv2_id
> 2017-05-15 14:59:34,157 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> service.QueryService:440 : The corrected query: select sum(v) from test
> group by lv2_id
> LIMIT 50000
> 2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> routing.QueryRouter:48 : The project manager's reference is
> org.apache.kylin.metadata.project.ProjectManager@ac216f8
> 2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and
> project=TEST : CUBE[name=TEST1ROWS1DIM]
> 2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.
> routing.rules.RemoveBlackoutRealizationsRule, realizations before:
> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
> 2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> cube.CubeCapabilityChecker:191 : Cube CUBE[name=TEST1ROWS1DIM]
> CapabilityInfluences: TOP10@class org.apache.kylin.measure.topn.
> TopNMeasureType
> 2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.
> routing.rules.RemoveUncapableRealizationsRule, realizations before:
> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
> 2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 21.
> 2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> routing.QueryRouter:51 : Applying rule: class org.apache.kylin.query.
> routing.rules.RealizationSortRule, realizations before:
> [TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
> 2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)]
> And the final chosen one is the first one
> 2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> topn.TopNMeasureType:338 : Rewrite function FunctionDesc [expression=SUM,
> parameter=ParameterDesc [type=column, value=V, nextParam=null],
> returnType=bigint] to FunctionDesc [expression=TOP_N,
> parameter=ParameterDesc [type=column, value=V, nextParam=ParameterDesc
> [type=column, value=LV2_ID, nextParam=null]], returnType=topn(10)]
> 2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified:
> cube=TEST1ROWS1DIM, cuboidId=1, groupsD=[], otherDimsD=[]
> 2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> gtrecord.GTCubeStorageQueryBase:230 : GroupD :[]
> 2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[]
> 2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns
> :[DEFAULT.TEST.LV2_ID]
> 2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> gtrecord.GTCubeStorageQueryBase:241 : Need storage aggregation
> 2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> storage.StorageContext:134 : Not enabling limit push down because the
> limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max
> 10000
> 2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> gtrecord.GTCubeStorageQueryBase:334 : Memory budget is set to 805306 rows
> 2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment
> 29542_29543
> 2017-05-15 14:59:34,219 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> dict.DictionaryManager:420 : DictionaryManager(1833236614) loading
> DictionaryInfo(loadDictObj:true) at /dict/DEFAULT.TEST/LV2_ID/
> 7be3d6b1-0ed4-4b2d-9015-0c92b0757725.dict
> 2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 306 bytes,
> rawScanBytesString 40 bytes
> 2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> v2.CubeHBaseEndpointRPC:141 : The scan 37ca24c9 for segment
> TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard
> part of start/end key is set to 0
> 2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid require
> post aggregation, from 0 to 1 Start: 
> \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00
> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:
>  \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00
> (\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
> 2017-05-15 14:59:34,256 INFO  [kylin-coproc--pool14-t13]
> v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query 
> 54ac350a-7676-4601-a109-04dec543e56e
> GTScanRequest 37ca24c9>Endpoint RPC returned from HTable KYLIN_EBM57UYZMQ
> Shard \x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\
> x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\
> x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\
> x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\
> x62\x34\x35\x31\x39\x38\x64\x30\x2E on host: W168PC03.Total scanned row:
> 1. Total filtered/aggred row: 0. Time elapsed in EP: 2(ms). Server CPU
> usage: 0.07331964572432702, server physical mem left: 8.619999232E10,
> server swap mem left:0.0.Etc message: start latency: 33@1,agg done@1,compress
> done@1,server stats done@2, 
> debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal
> Complete: true.
> 2017-05-15 14:59:34,259 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> service.QueryService:576 : Scan count for each storageContext: 1,
> 2017-05-15 14:59:34,259 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> service.QueryService:359 : Stats of SQL response: isException: false,
> duration: 102, total scan count 1
> 2017-05-15 14:59:34,260 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108]
> service.QueryService:284 :
> ==========================[QUERY]===============================
> Query Id: 54ac350a-7676-4601-a109-04dec543e56e
> SQL: select sum(v) from test group by lv2_id
> LIMIT 50000
> User: ADMIN
> Success: true
> Duration: 0.102
> Project: TEST
> Realization Names: [TEST1ROWS1DIM]
> Cuboid Ids: [1]
> Total scan count: 1
> Result row count: 1
> Accept Partial: true
> Is Partial Result: false
> Hit Exception Cache: false
> Storage cache used: false
> Message: null
> ==========================[QUERY]===============================
>
>
>
>
>
> cube definition:
>
> {
>   "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75",
>   "last_modified": 1494573765171,
>   "version": "1.6.0",
>   "name": "TEST1ROWS1DIM",
>   "model_name": "TEST",
>   "description": "",
>   "null_string": null,
>   "dimensions": [
>     {
>       "name": "LV2_ID",
>       "table": "DEFAULT.TEST",
>       "column": "LV2_ID",
>       "derived": null
>     },
>     {
>       "name": "MINUTE_START",
>       "table": "DEFAULT.TEST",
>       "column": "MINUTE_START",
>       "derived": null
>     }
>   ],
>   "measures": [
>     {
>       "name": "_COUNT_",
>       "function": {
>         "expression": "COUNT",
>         "parameter": {
>           "type": "constant",
>           "value": "1",
>           "next_parameter": null
>         },
>         "returntype": "bigint"
>       },
>       "dependent_measure_ref": null
>     },
>     {
>       "name": "COUNT",
>       "function": {
>         "expression": "SUM",
>         "parameter": {
>           "type": "column",
>           "value": "V",
>           "next_parameter": null
>         },
>         "returntype": "bigint"
>       },
>       "dependent_measure_ref": null
>     },
>     {
>       "name": "TOP10",
>       "function": {
>         "expression": "TOP_N",
>         "parameter": {
>           "type": "column",
>           "value": "V",
>           "next_parameter": {
>             "type": "column",
>             "value": "LV2_ID",
>             "next_parameter": null
>           }
>         },
>         "returntype": "topn(10)",
>         "configuration": {
>           "topn.encoding.LV2_ID": "dict"
>         }
>       },
>       "dependent_measure_ref": null
>     }
>   ],
>   "dictionaries": [],
>   "rowkey": {
>     "rowkey_columns": [
>       {
>         "column": "MINUTE_START",
>         "encoding": "time",
>         "isShardBy": false
>       },
>       {
>         "column": "LV2_ID",
>         "encoding": "dict",
>         "isShardBy": false
>       }
>     ]
>   },
>   "hbase_mapping": {
>     "column_family": [
>       {
>         "name": "F1",
>         "columns": [
>           {
>             "qualifier": "M",
>             "measure_refs": [
>               "_COUNT_",
>               "COUNT",
>               "TOP10"
>             ]
>           }
>         ]
>       }
>     ]
>   },
>   "aggregation_groups": [
>     {
>       "includes": [
>         "MINUTE_START",
>         "LV2_ID"
>       ],
>       "select_rule": {
>         "hierarchy_dims": [],
>         "mandatory_dims": [],
>         "joint_dims": []
>       }
>     }
>   ],
>   "signature": "o75jqwjKPFF859EyU6j9eA==",
>   "notify_list": [],
>   "status_need_notify": [
>     "ERROR",
>     "DISCARDED",
>     "SUCCEED"
>   ],
>   "partition_date_start": 0,
>   "partition_date_end": 3153600000000,
>   "auto_merge_time_ranges": [
>     604800000,
>     2419200000 <024%201920%200000>
>   ],
>   "retention_range": 0,
>   "engine_type": 2,
>   "storage_type": 2,
>   "override_kylin_properties": {}
> }
>
>
>
> ------------------------------
> *From:* ShaoFeng Shi <[email protected]>
> *Sent:* Monday, May 15, 2017 1:37 AM
> *To:* user
> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>
> Hi Tingmao,
>
> Your finding is correct; Kylin didn't check whether there is "order by
> limit" pattern or not. The main reason is, the SQL parser didn't push down
> the "order by limit" info here (at least when TopN was developed), so I
> couldn't leverage that.  Another reason is usually we won't put the "group
> and order" column as a normal cube dimension (but Kylin didn't stop you
> from doing that).
>
> Please feel free to open a JIRA on the checking "order by limit" pattern,
> I can check that later.
>
>
>
> 2017-05-15 2:37 GMT+08:00 Tingmao Lin <[email protected]>:
>
>> Yeah, we looked into the TopNCounter source code and found that for low 
>> cardinality
>> scenario, in TopNCounter, m1=m2=0 so it just sum up the values. But the
>> result still goes wrong. I will collect more information for
>> investigation ASAP :)
>>
>>
>> I have a question: as Billy Liu said in this thread kylin will check the
>> ORDER BY clause to determine whether to rewrite. But  I didn't find any
>> access to SQLdigest.sortColumns  in  
>> TopNMeasureType.influenceCapabilityCheck().
>>  Does kylin check it elsewhere ?  It seems that  if   sum(measure) is
>> the only measure in the query  and the group by  column matches, then
>> TopNMeasureType.isTopNCompatibleSum()    will pass and the query get
>> rewritten. This confuses the user since they may expect a accurate result
>> for every distinct value of group by column(s).
>>
>> ------------------------------
>> *From:* ShaoFeng Shi <[email protected]>
>> *Sent:* Sunday, May 14, 2017 2:13 PM
>>
>> *To:* user
>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>>
>> Agree with Yang's points; When cardinality is small than the TopN
>> counter's capacity, the result should be accurate. I checked the 1.6 source
>> code and didn't find clue. Please share more information (cube definition
>> and query logs) for investigation, thanks!
>>
>> For negtive number in TopN, actually that isn't recommended, as it goes
>> against TopN's purpose, which is to counting something happened. When
>> merging two TopN counters, one counter will use another's last element's
>> number to accumulate (if another is full) on its elements (as a
>> supplement). If the last element is close to 0, its impaction will be
>> minor. But if the last element is a big negative value, you know it's
>> impaction will be considerable!  It doesn't make sense to reduce existing
>> element's counting value if the last element's value is negative. So please
>> use it properly in your scenario. Ofcourse, I think Kylin should also add
>> more checks there.
>>
>> 2017-05-14 17:18 GMT+08:00 Tingmao Lin <[email protected]>:
>>
>>> The SQL in the original email is exactly what we input in the "insight"
>>> tab in kylin admin UI.
>>>
>>> I do not have access to the host running kylin now ,and I will post the
>>> detailed log output tomorrow.
>>>
>>>
>>> We reproduced the inaccurate result behavior using a source table with
>>> <10 rows and  I should be able to write a reproduce step tomorrow.
>>>
>>> ------------------------------
>>> *From:* Billy Liu <[email protected]>
>>> *Sent:* Sunday, May 14, 2017 12:21 AM
>>> *To:* user
>>> *Subject:* Re: Questions about SUM behavior when rewritten as TOPN
>>>
>>> Thanks Tingmao for the report.
>>>
>>> Could you show us the complete SQL? In your SQL, there is no order by
>>> statement. If no ORDER BY, the query should not be rewritten into TopN
>>> measure.
>>>
>>> 2017-05-12 23:52 GMT+08:00 Tingmao Lin <[email protected]>:
>>>
>>>> Hi,
>>>>
>>>> We found that SUM() query on a cardinality 1 dimension is not accurate
>>>> (or "not correct") when automatically  rewritten as TOPN.
>>>> Is that the expected behavior of kylin or there are any other issue?
>>>>
>>>> We built a cube on a table ( measure1: bigint, dim1_id:varchar,
>>>> dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source)
>>>>
>>>> The cube has two measures: SUM(measure1) and
>>>> TOPN(10,sum-orderby(measure1),group by dim2_id) . (other measures
>>>> omitted)
>>>> and two dimensions  dim1_id, dim2_id   (other dims omitted)
>>>>
>>>> About the source table data:
>>>> The cardinality of dim1_id  is 1 (same dim1_id for all rows in the
>>>> source table)
>>>> The cardinality of dim2_id  is 1 (same dim2_id for all rows in the
>>>> source table)
>>>> The possible value of measure1 is [1,0,-1]
>>>>
>>>> When we query
>>>>     "select SUM(measure1) FROM table GROUP BY dim2_id"
>>>>    =>     the result has one row:"sum=7",
>>>>       from the kylin logs we found that the query has been
>>>> automatically  rewritten as TOPN(measure1,sum-orderby(measure1),group
>>>> by dim2_id)
>>>>
>>>> When we write another query to prevent TOPN rewrite, for example:
>>>>
>>>>    "select SUM(measure1),count(*) FROM table GROUP BY dim2_id"     =>   one
>>>> row -- "sum=-2,count=24576"
>>>>
>>>>    "select SUM(measure1),count(*) FROM table"
>>>>                =>   one row -- "sum=-2,count=24576"
>>>>
>>>>
>>>> The result is different (7 and -2) when rewritting to TOPN or not.
>>>>
>>>>
>>>> My question is: are the following behavior "works as expected" ,or TOPN
>>>> algorithm does not support negative counter values very well , or any issue
>>>> there?
>>>>
>>>>
>>>> 1. SUM() query  automatically rewritten as TOPN and gives approximated
>>>> result when no TOPN present in the query.
>>>>
>>>> 2. When cardinality is 1, TOPN does not give accurate result.
>>>>
>>>>
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re: Questions about SUM behavior when rewritten as TOPN

Reply via email to