Re: Questions about SUM behavior when rewritten as TOPN

Tingmao Lin Mon, 15 May 2017 04:22:48 -0700

Yes, the serializer just don't work with negative counters. That explains why 
the result is inaccurate and it's not a bug.



About the SUM auto-rewrite behavior,

I have created jira: https://issues.apache.org/jira/browse/KYLIN-2620

Thank Shi ShaoFeng, Li Yang and Billy Liu for help resolving the questions.

________________________________
From: ShaoFeng Shi <[email protected]>
Sent: Monday, May 15, 2017 10:36 AM
To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN

Ok then it explains.

Top N algorithm is based on the assumption that your data follows Zipf 
distribution (https://en.wikipedia.org/wiki/Zipf%27s_law). If not, the result 
can be very inaccurate. Your sample is such a case, with only 1 key and 
negative counters, the error be accumulated and finally see a variation from 
the real value, which is not bug I think. Please use SUM measure, instead of 
abusing TopN.
[https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/Zipf_distribution_PMF.png/325px-Zipf_distribution_PMF.png]<https://en.wikipedia.org/wiki/Zipf%27s_law>

Zipf's law - Wikipedia<https://en.wikipedia.org/wiki/Zipf%27s_law>
en.wikipedia.org
Zipf's law (/ ˈ z ɪ f /) is an empirical law formulated using mathematical 
statistics that refers to the fact that many types of data studied in the 
physical and ...



2017-05-15 16:55 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>:

We found something about serialization precision that might be related to the 
inaccurate result problem.


the TopNCounterSerializer uses DoubleDeltaSerializer to serialize double array. 
 The DoubleDeltaSerializer will write double values as  int(d*1000+0.5)/1000 so 
for each serialization process, negative integer value increases by 0.001.


Test case:    (slightly modified from 
TopNCounterSerializerTest.testSerialization)


@Test
public void testNegativeSerialization() {
    TopNCounter<ByteArray> vs = new TopNCounter<ByteArray>(50);
    Integer[] stream = {1};
    double[] incValues = {-1};
    for (int counter = 0; counter < stream.length; ++counter) {
        Integer i = stream[counter]; double v = incValues[counter];
        vs.offer(new ByteArray(Bytes.toBytes(i)), v);
    }
    vs.sortAndRetain();
    ByteBuffer out = ByteBuffer.allocate(1024);
    serializer.serialize(vs, out);

    byte[] copyBytes = new byte[out.position()];
    System.arraycopy(out.array(), 0, copyBytes, 0, out.position());

    ByteBuffer in = ByteBuffer.wrap(copyBytes);
    TopNCounter<ByteArray> vsNew = serializer.deserialize(in);

    Assert.assertEquals(vs.toString(), vsNew.toString());
}


and result is

org.junit.ComparisonFailure:
Expected :[\x00\x00\x00\x01:-1.0]
Actual   :[\x00\x00\x00\x01:-0.999]



so if we set type of measure1 as double, and build the cube using a single line 
{measure1 = -1.0}
the "select sum(measure1) from table group by dim2_id"  will return  "-0.994" 
instead of "-1"


________________________________
From: Tingmao Lin <[email protected]<mailto:[email protected]>>
Sent: Monday, May 15, 2017 8:44 AM
To: [email protected]<mailto:[email protected]>

Subject: Re: Questions about SUM behavior when rewritten as TOPN


The query log and the cube definition is here:


Query without rewrite
2017-05-15 14:59:50,741 INFO  [pool-8-thread-1] threadpool.DefaultScheduler:118 
: Job Fetcher: 0 should running, 0 actual running, 0 ready, 577 already 
succeed, 19 error, 11 discarded, 0 others
2017-05-15 14:59:51,637 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
service.QueryService:336 : Using project: TEST
2017-05-15 14:59:51,638 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
service.QueryService:337 : The original query:  select sum(v),count(v) from 
test group by lv2_id
2017-05-15 14:59:51,638 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
service.QueryService:440 : The corrected query: select sum(v),count(v) from 
test group by lv2_id
LIMIT 50000
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
routing.QueryRouter:48 : The project manager's reference is 
org.apache.kylin.metadata.project.ProjectManager@ac216f8
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and project=TEST 
: CUBE[name=TEST1ROWS1DIM]
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
routing.QueryRouter:51 : Applying rule: class 
org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, 
realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: 
[TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,674 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
routing.QueryRouter:51 : Applying rule: class 
org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, 
realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: 
[TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 73.
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
routing.QueryRouter:51 : Applying rule: class 
org.apache.kylin.query.routing.rules.RealizationSortRule, realizations before: 
[TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:51,675 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)] And 
the final chosen one is the first one
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: cube=TEST1ROWS1DIM, 
cuboidId=1, groupsD=[DEFAULT.TEST.LV2_ID], otherDimsD=[]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
gtrecord.GTCubeStorageQueryBase:230 : GroupD :[DEFAULT.TEST.LV2_ID]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns :[DEFAULT.TEST.LV2_ID]
2017-05-15 14:59:51,707 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
gtrecord.GTCubeStorageQueryBase:238 : Does not need storage aggregation
2017-05-15 14:59:51,708 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
storage.StorageContext:134 : Not enabling limit push down because the 
limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max 
10000
2017-05-15 14:59:51,708 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment 29542_29543
2017-05-15 14:59:51,763 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 311 bytes, 
rawScanBytesString 40 bytes
2017-05-15 14:59:51,764 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
v2.CubeHBaseEndpointRPC:141 : The scan 6fdd1fd2 for segment 
TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard part of 
start/end key is set to 0
2017-05-15 14:59:51,764 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid exact 
match, from 1 to 1 Start: \x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 
(\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:  
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 
(\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
2017-05-15 14:59:51,768 INFO  [kylin-coproc--pool14-t14] 
v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query 
cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20 GTScanRequest 6fdd1fd2>Endpoint RPC 
returned from HTable KYLIN_EBM57UYZMQ Shard 
\x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E
 on host: W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time 
elapsed in EP: 1(ms). Server CPU usage: 0.04998115611708333, server physical 
mem left: 8.621008896E10, server swap mem left:0.0.Etc message: start latency: 
58@0,agg done@1,compress done@1,server stats done@1, 
debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete: true.
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
service.QueryService:576 : Scan count for each storageContext: 1,
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
service.QueryService:359 : Stats of SQL response: isException: false, duration: 
133, total scan count 1
2017-05-15 14:59:51,771 INFO  [Query cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20-104] 
service.QueryService:284 :
==========================[QUERY]===============================
Query Id: cf1fe7fa-97af-4a4c-ae49-d6cf38ebcd20
SQL: select sum(v),count(v) from test group by lv2_id
LIMIT 50000
User: ADMIN
Success: true
Duration: 0.133
Project: TEST
Realization Names: [TEST1ROWS1DIM]
Cuboid Ids: [1]
Total scan count: 1
Result row count: 1
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Message: null
==========================[QUERY]===============================





Query with rewrite:
2017-05-15 14:59:34,156 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
service.QueryService:336 : Using project: TEST
2017-05-15 14:59:34,157 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
service.QueryService:337 : The original query:  select sum(v) from test group 
by lv2_id
2017-05-15 14:59:34,157 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
service.QueryService:440 : The corrected query: select sum(v) from test group 
by lv2_id
LIMIT 50000
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
routing.QueryRouter:48 : The project manager's reference is 
org.apache.kylin.metadata.project.ProjectManager@ac216f8
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
routing.QueryRouter:60 : Find candidates by table DEFAULT.TEST and project=TEST 
: CUBE[name=TEST1ROWS1DIM]
2017-05-15 14:59:34,185 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
routing.QueryRouter:51 : Applying rule: class 
org.apache.kylin.query.routing.rules.RemoveBlackoutRealizationsRule, 
realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: 
[TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
cube.CubeCapabilityChecker:191 : Cube CUBE[name=TEST1ROWS1DIM] 
CapabilityInfluences: TOP10@class org.apache.kylin.measure.topn.TopNMeasureType
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
routing.QueryRouter:51 : Applying rule: class 
org.apache.kylin.query.routing.rules.RemoveUncapableRealizationsRule, 
realizations before: [TEST1ROWS1DIM(CUBE)], realizations after: 
[TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
rules.RealizationSortRule:40 : CUBE[name=TEST1ROWS1DIM] priority 1 cost 21.
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
routing.QueryRouter:51 : Applying rule: class 
org.apache.kylin.query.routing.rules.RealizationSortRule, realizations before: 
[TEST1ROWS1DIM(CUBE)], realizations after: [TEST1ROWS1DIM(CUBE)]
2017-05-15 14:59:34,186 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
routing.QueryRouter:72 : The realizations remaining: [TEST1ROWS1DIM(CUBE)] And 
the final chosen one is the first one
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
topn.TopNMeasureType:338 : Rewrite function FunctionDesc [expression=SUM, 
parameter=ParameterDesc [type=column, value=V, nextParam=null], 
returnType=bigint] to FunctionDesc [expression=TOP_N, parameter=ParameterDesc 
[type=column, value=V, nextParam=ParameterDesc [type=column, value=LV2_ID, 
nextParam=null]], returnType=topn(10)]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
gtrecord.GTCubeStorageQueryBase:105 : Cuboid identified: cube=TEST1ROWS1DIM, 
cuboidId=1, groupsD=[], otherDimsD=[]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
gtrecord.GTCubeStorageQueryBase:230 : GroupD :[]
2017-05-15 14:59:34,217 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
gtrecord.GTCubeStorageQueryBase:231 : SingleValueD :[]
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
gtrecord.GTCubeStorageQueryBase:232 : Cuboid columns :[DEFAULT.TEST.LV2_ID]
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
gtrecord.GTCubeStorageQueryBase:241 : Need storage aggregation
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
storage.StorageContext:134 : Not enabling limit push down because the 
limit(including offset) 50000 is larger than kylin.query.pushdown.limit.max 
10000
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
gtrecord.GTCubeStorageQueryBase:334 : Memory budget is set to 805306 rows
2017-05-15 14:59:34,218 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
gtrecord.CubeSegmentScanner:57 : Init CubeSegmentScanner for segment 29542_29543
2017-05-15 14:59:34,219 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
dict.DictionaryManager:420 : DictionaryManager(1833236614) loading 
DictionaryInfo(loadDictObj:true) at 
/dict/DEFAULT.TEST/LV2_ID/7be3d6b1-0ed4-4b2d-9015-0c92b0757725.dict
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
v2.CubeHBaseEndpointRPC:139 : Serialized scanRequestBytes 306 bytes, 
rawScanBytesString 40 bytes
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
v2.CubeHBaseEndpointRPC:141 : The scan 37ca24c9 for segment 
TEST1ROWS1DIM[29542_29543] is as below with 1 separate raw scans, shard part of 
start/end key is set to 0
2017-05-15 14:59:34,221 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
v2.CubeHBaseRPC:278 : Visiting hbase table KYLIN_EBM57UYZMQ: cuboid require 
post aggregation, from 0 to 1 Start: 
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00 
(\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00) Stop:  
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00 
(\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xFF\x00), No Fuzzy Key
2017-05-15 14:59:34,256 INFO  [kylin-coproc--pool14-t13] 
v2.CubeHBaseEndpointRPC:200 : <sub-thread for Query 
54ac350a-7676-4601-a109-04dec543e56e GTScanRequest 37ca24c9>Endpoint RPC 
returned from HTable KYLIN_EBM57UYZMQ Shard 
\x4B\x59\x4C\x49\x4E\x5F\x45\x42\x4D\x35\x37\x55\x59\x5A\x4D\x51\x2C\x2C\x31\x34\x39\x34\x35\x37\x33\x38\x39\x30\x38\x33\x35\x2E\x37\x33\x35\x38\x33\x62\x36\x38\x30\x39\x62\x65\x39\x61\x37\x37\x66\x61\x32\x64\x65\x65\x62\x63\x62\x34\x35\x31\x39\x38\x64\x30\x2E
 on host: W168PC03.Total scanned row: 1. Total filtered/aggred row: 0. Time 
elapsed in EP: 2(ms). Server CPU usage: 0.07331964572432702, server physical 
mem left: 8.619999232E10, server swap mem left:0.0.Etc message: start latency: 
33@1,agg done@1,compress done@1,server stats done@2, 
debugGitTag:0fca07178088320fda44513c63a1134b7d39427c;.Normal Complete: true.
2017-05-15 14:59:34,259 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
service.QueryService:576 : Scan count for each storageContext: 1,
2017-05-15 14:59:34,259 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
service.QueryService:359 : Stats of SQL response: isException: false, duration: 
102, total scan count 1
2017-05-15 14:59:34,260 INFO  [Query 54ac350a-7676-4601-a109-04dec543e56e-108] 
service.QueryService:284 :
==========================[QUERY]===============================
Query Id: 54ac350a-7676-4601-a109-04dec543e56e
SQL: select sum(v) from test group by lv2_id
LIMIT 50000
User: ADMIN
Success: true
Duration: 0.102
Project: TEST
Realization Names: [TEST1ROWS1DIM]
Cuboid Ids: [1]
Total scan count: 1
Result row count: 1
Accept Partial: true
Is Partial Result: false
Hit Exception Cache: false
Storage cache used: false
Message: null
==========================[QUERY]===============================





cube definition:

{
  "uuid": "4a7fae8c-b518-4828-8f75-e6cd04689d75",
  "last_modified": 1494573765171,
  "version": "1.6.0",
  "name": "TEST1ROWS1DIM",
  "model_name": "TEST",
  "description": "",
  "null_string": null,
  "dimensions": [
    {
      "name": "LV2_ID",
      "table": "DEFAULT.TEST",
      "column": "LV2_ID",
      "derived": null
    },
    {
      "name": "MINUTE_START",
      "table": "DEFAULT.TEST",
      "column": "MINUTE_START",
      "derived": null
    }
  ],
  "measures": [
    {
      "name": "_COUNT_",
      "function": {
        "expression": "COUNT",
        "parameter": {
          "type": "constant",
          "value": "1",
          "next_parameter": null
        },
        "returntype": "bigint"
      },
      "dependent_measure_ref": null
    },
    {
      "name": "COUNT",
      "function": {
        "expression": "SUM",
        "parameter": {
          "type": "column",
          "value": "V",
          "next_parameter": null
        },
        "returntype": "bigint"
      },
      "dependent_measure_ref": null
    },
    {
      "name": "TOP10",
      "function": {
        "expression": "TOP_N",
        "parameter": {
          "type": "column",
          "value": "V",
          "next_parameter": {
            "type": "column",
            "value": "LV2_ID",
            "next_parameter": null
          }
        },
        "returntype": "topn(10)",
        "configuration": {
          "topn.encoding.LV2_ID": "dict"
        }
      },
      "dependent_measure_ref": null
    }
  ],
  "dictionaries": [],
  "rowkey": {
    "rowkey_columns": [
      {
        "column": "MINUTE_START",
        "encoding": "time",
        "isShardBy": false
      },
      {
        "column": "LV2_ID",
        "encoding": "dict",
        "isShardBy": false
      }
    ]
  },
  "hbase_mapping": {
    "column_family": [
      {
        "name": "F1",
        "columns": [
          {
            "qualifier": "M",
            "measure_refs": [
              "_COUNT_",
              "COUNT",
              "TOP10"
            ]
          }
        ]
      }
    ]
  },
  "aggregation_groups": [
    {
      "includes": [
        "MINUTE_START",
        "LV2_ID"
      ],
      "select_rule": {
        "hierarchy_dims": [],
        "mandatory_dims": [],
        "joint_dims": []
      }
    }
  ],
  "signature": "o75jqwjKPFF859EyU6j9eA==",
  "notify_list": [],
  "status_need_notify": [
    "ERROR",
    "DISCARDED",
    "SUCCEED"
  ],
  "partition_date_start": 0,
  "partition_date_end": 3153600000000,
  "auto_merge_time_ranges": [
    604800000,
    2419200000<tel:024%201920%200000>
  ],
  "retention_range": 0,
  "engine_type": 2,
  "storage_type": 2,
  "override_kylin_properties": {}
}



________________________________
From: ShaoFeng Shi <[email protected]<mailto:[email protected]>>
Sent: Monday, May 15, 2017 1:37 AM
To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN

Hi Tingmao,

Your finding is correct; Kylin didn't check whether there is "order by limit" 
pattern or not. The main reason is, the SQL parser didn't push down the "order 
by limit" info here (at least when TopN was developed), so I couldn't leverage 
that.  Another reason is usually we won't put the "group and order" column as a 
normal cube dimension (but Kylin didn't stop you from doing that).

Please feel free to open a JIRA on the checking "order by limit" pattern, I can 
check that later.



2017-05-15 2:37 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>:

Yeah, we looked into the TopNCounter source code and found that for low 
cardinality scenario, in TopNCounter, m1=m2=0 so it just sum up the values. But 
the result still goes wrong. I will collect more information for investigation 
ASAP :)


I have a question: as Billy Liu said in this thread kylin will check the ORDER 
BY clause to determine whether to rewrite. But  I didn't find any access to 
SQLdigest.sortColumns  in  TopNMeasureType.influenceCapabilityCheck().  Does 
kylin check it elsewhere ?  It seems that  if   sum(measure) is the only 
measure in the query  and the group by  column matches, then 
TopNMeasureType.isTopNCompatibleSum()    will pass and the query get rewritten. 
This confuses the user since they may expect a accurate result for every 
distinct value of group by column(s).

________________________________
From: ShaoFeng Shi <[email protected]<mailto:[email protected]>>
Sent: Sunday, May 14, 2017 2:13 PM

To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN

Agree with Yang's points; When cardinality is small than the TopN counter's 
capacity, the result should be accurate. I checked the 1.6 source code and 
didn't find clue. Please share more information (cube definition and query 
logs) for investigation, thanks!

For negtive number in TopN, actually that isn't recommended, as it goes against 
TopN's purpose, which is to counting something happened. When merging two TopN 
counters, one counter will use another's last element's number to accumulate 
(if another is full) on its elements (as a supplement). If the last element is 
close to 0, its impaction will be minor. But if the last element is a big 
negative value, you know it's impaction will be considerable!  It doesn't make 
sense to reduce existing element's counting value if the last element's value 
is negative. So please use it properly in your scenario. Ofcourse, I think 
Kylin should also add more checks there.

2017-05-14 17:18 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>:

The SQL in the original email is exactly what we input in the "insight" tab in 
kylin admin UI.

I do not have access to the host running kylin now ,and I will post the 
detailed log output tomorrow.


We reproduced the inaccurate result behavior using a source table with <10 rows 
and  I should be able to write a reproduce step tomorrow.

________________________________
From: Billy Liu <[email protected]<mailto:[email protected]>>
Sent: Sunday, May 14, 2017 12:21 AM
To: user
Subject: Re: Questions about SUM behavior when rewritten as TOPN

Thanks Tingmao for the report.

Could you show us the complete SQL? In your SQL, there is no order by 
statement. If no ORDER BY, the query should not be rewritten into TopN measure.

2017-05-12 23:52 GMT+08:00 Tingmao Lin <[email protected]<mailto:[email protected]>>:

Hi,

We found that SUM() query on a cardinality 1 dimension is not accurate (or "not 
correct") when automatically  rewritten as TOPN.
Is that the expected behavior of kylin or there are any other issue?

We built a cube on a table ( measure1: bigint, dim1_id:varchar, 
dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source)

The cube has two measures: SUM(measure1) and 
TOPN(10,sum-orderby(measure1),group by dim2_id) . (other measures omitted)
and two dimensions  dim1_id, dim2_id   (other dims omitted)

About the source table data:
The cardinality of dim1_id  is 1 (same dim1_id for all rows in the source table)
The cardinality of dim2_id  is 1 (same dim2_id for all rows in the source table)
The possible value of measure1 is [1,0,-1]

When we query
    "select SUM(measure1) FROM table GROUP BY dim2_id"                    =>    
 the result has one row:"sum=7",
      from the kylin logs we found that the query has been automatically  
rewritten as TOPN(measure1,sum-orderby(measure1),group by dim2_id)

When we write another query to prevent TOPN rewrite, for example:

   "select SUM(measure1),count(*) FROM table GROUP BY dim2_id"     =>   one row 
-- "sum=-2,count=24576"

   "select SUM(measure1),count(*) FROM table"                                   
     =>   one row -- "sum=-2,count=24576"


The result is different (7 and -2) when rewritting to TOPN or not.


My question is: are the following behavior "works as expected" ,or TOPN 
algorithm does not support negative counter values very well , or any issue 
there?


1. SUM() query  automatically rewritten as TOPN and gives approximated result 
when no TOPN present in the query.

2. When cardinality is 1, TOPN does not give accurate result.




Thanks.






--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋




--
Best regards,

Shaofeng Shi 史少锋

Re: Questions about SUM behavior when rewritten as TOPN

Reply via email to