Re: Questions about SUM behavior when rewritten as TOPN

Li Yang Sun, 14 May 2017 01:59:49 -0700

Em... this will be interesting to investigate. JIRA created.
https://issues.apache.org/jira/browse/KYLIN-2617


And sure, TOPN is approximate algorithm and it does not give precise
result. Nevertheless, cardinality 1 is very special case, I think even
approximate algorithm should give correct result in such case.



On Sun, May 14, 2017 at 8:21 AM, Billy Liu <[email protected]> wrote:

> Thanks Tingmao for the report.
>
> Could you show us the complete SQL? In your SQL, there is no order by
> statement. If no ORDER BY, the query should not be rewritten into TopN
> measure.
>
> 2017-05-12 23:52 GMT+08:00 Tingmao Lin <[email protected]>:
>
>> Hi,
>>
>> We found that SUM() query on a cardinality 1 dimension is not accurate
>> (or "not correct") when automatically  rewritten as TOPN.
>> Is that the expected behavior of kylin or there are any other issue?
>>
>> We built a cube on a table ( measure1: bigint, dim1_id:varchar,
>> dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source)
>>
>> The cube has two measures: SUM(measure1) and
>> TOPN(10,sum-orderby(measure1),group by dim2_id) . (other measures
>> omitted)
>> and two dimensions  dim1_id, dim2_id   (other dims omitted)
>>
>> About the source table data:
>> The cardinality of dim1_id  is 1 (same dim1_id for all rows in the
>> source table)
>> The cardinality of dim2_id  is 1 (same dim2_id for all rows in the source
>> table)
>> The possible value of measure1 is [1,0,-1]
>>
>> When we query
>>     "select SUM(measure1) FROM table GROUP BY dim2_id"
>>  =>     the result has one row:"sum=7",
>>       from the kylin logs we found that the query has been automatically  
>> rewritten
>> as TOPN(measure1,sum-orderby(measure1),group by dim2_id)
>>
>> When we write another query to prevent TOPN rewrite, for example:
>>
>>    "select SUM(measure1),count(*) FROM table GROUP BY dim2_id"     =>   one
>> row -- "sum=-2,count=24576"
>>
>>    "select SUM(measure1),count(*) FROM table"
>>              =>   one row -- "sum=-2,count=24576"
>>
>>
>> The result is different (7 and -2) when rewritting to TOPN or not.
>>
>>
>> My question is: are the following behavior "works as expected" ,or TOPN
>> algorithm does not support negative counter values very well , or any issue
>> there?
>>
>>
>> 1. SUM() query  automatically rewritten as TOPN and gives approximated
>> result when no TOPN present in the query.
>>
>> 2. When cardinality is 1, TOPN does not give accurate result.
>>
>>
>>
>>
>> Thanks.
>>
>>
>>
>>
>

Re: Questions about SUM behavior when rewritten as TOPN

Reply via email to