Re: [DISCUSSION] Don't need to purge existing segment of cube to add new measures in Kylin

yuzhang Mon, 22 Apr 2019 17:56:25 -0700

Hi Shaofeng:
    We also take some experiment for add measure after cube be built and 
encountered byte error at the very start. The default mapping strategy between 
HBase store and measure definition is "multiple measures are stored in one 
column of column family", which may cause byte error after add a measure and 
insert it in original measure sequence. Add an column for new measure may be 
better, I think.
    
    I just have a preliminary idea, may be impractical for now, about the 
measure management design. 
Dimensions and metrics are defined once model be designed. The measure 
aggregate the metrics in different dimensions to observe the data entities 
represented by the model. All of these are design of 'logical view', I think. 
The Cube is materialized view of these logical model, which is the bridge 
between the logical view and the physical storage (and the highway is set up). 
The life cycle of the measure may depend on the model rather than the cube.

Based on the design, an measure management can be set up after model design
be completed. We can define the measure based on model. Cubes under the model
can reuse those measure and build their segment data. When a SQL arrive, Kylin
query server need to find the suitable model with suitable measure, then find
the available cube.

Of course, such an design change will have a very large impact on the
existing kylin architecture, and the query and metadata will have very large
changes. So it seems that it is still on paper.
More realistic or transitional design is increasing the metadata of the
measure. Just as CubeDesc defines the schema, and a relative CubeInstance
manages the built Segments. MeasureDesc can also has a MeasureInstance to
manage the segment containing it.
I observed that kylin's query service generates a GridTable for mapping between
logical views and HBase physical storage: Cuboid + Measure -> Grid Table <-
HBase store. This Grid Table is generated based on CubeDesc and has such a
mapping process for each Segment. Therefore, in the mapping stage, it is
possible to know which columns of the Grid Table can't be obtained in current
segment by the metadata. So the measure data can be selectively read at the RS
backend.
But its life cycle is the same as MeasureDesc, managed by CubeDesc.

Regarding adding dimensions to the same cube, we also need to consider
aggregation groups and Rowkey order. I am curious and interesting how you
implemented it.

Best regards

yuzhang

Glad to see such a discussion; How to support "schema change" in a friendly
way is what we should do in the next phase, as we see this requirement is
stronger than before.

Last week I also did a try on 1) adding a dimension after cube be built,
and 2) adding a measure after cube be built;

For 1) I have got an idea, the first try was successful, and want to
discuss it with the community in some day.

The 2) was failed; after a new measure is added, the query got failed and
in HBase RS side there is byte parsing error. Then I didn't continue that.

Could you elaborate your idea on "the measures of the analysis system can
be decoupled from the materialized view(cube) and have their own management
system"? Have you got a rough design on it? Thank you!

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: [email protected]

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]

yuzhang <[email protected]> 于2019年4月21日周日 下午8:08写道：

Hi JiaTao:
Maybe it's necessary that there is an optional auto-complete machanism
among different measure's view, isn't it?

yuzhang

The idea that supports Kylin adding measures dynamically is impressive.

But in my opinion, once you add a measure, the existing segments should
also calculate the new measure(just add a new measure column). Users can
have many cubes, a cube can have many segments, if measure's view is
different in each segment, it will increase the burden of the user.

Regards!

Aron Tao

yuzhang <[email protected]> 于2019年4月20日周六 上午1:43写道：

Hi dear kylin users and develop team:
Here have some things I want to discuss with community.
As a representative of MOLAP engine, kylin uses pre-aggregation strategies
to provide high-concurrency and second-level response analysis
capabilities, but also loses some flexibility.
The limitation that purge existing segment firstly to add an additional
measure will cause many double calculation and unnecessary disk IO. Such
waste should be avoid especially in MOLAP engine.
For example, there is an cubeA with one measure m1 and segments over time
range1(tr1). Now, user add one measure m2, but don't want to clear segments
over tr1. The value of m2 will exist in tr2, the segments build
subsequently. Sure, tr1 doesn't contain value of m2, which will be
understanded by user who know litte about MOLAP. Querying over tr1 and tr2
is valid for both m1 and m2, but the result of m2 over tr1 will be null.
It's will be better to reminder user the measure missing.Moreover,
refreshing will supply the m2 to segments over tr1.
Currently, kylin's storage engine uses HBase. The measure are aggregated
values based on combination of various dimension members and stored in a
column of a Column Family in HBase. For the same cube, adding a new measure
will add a column to the HBase table(mapping) and will take effect in the
next build. For the existing HTables(segments), the new column is allowed
to be missing. Refreshing old existing segments will add a new column in
their HTable to store new measure. Value of new measure is aggregated
according to the combination of dimension members in rowkey, without
recalculating existing measure.
Now, For additional measure and even additional dimensions, Kylin's
current solution is Hybrid, but we found the following shortcomings during
use:
1. Management costs: Repeated maintenance of similar Cubes, most of which
have many intersections of dimensions and indicators. If you want to
perform optimization operations such as pruning, you need to configure all
of these cubes.
2. A large number of cubes: The initial analysis of the business is not
stable, and analysts often have the need to increase some measures. The
cube is added continuously to the Hybrid group, which will produce a lot of
cubes.
3. Repeat calculation: If you want to drop the old cube in the Hybrid
group, you need to build the latest cube by compute historical data to
cover the old cube.
Those will result in a lot of waste.
In addition, I felt that the metadata about the measure was not perfect
during the applying of Kylin.
1. As one of the most important concerns of analysts, if the measures of
the analysis system can be decoupled from the materialized view(cube) and
have their own management system, it may be more flexibility.
2. Once the dimensions have been choose in cube designing, it's cuboids
are confirmed no matter the number of measures. It may make confuse to
maintenance cubes with different measures but same cuboids. Cubes with
different cuboids should be considered different cube, which is the
definition of cube, isn't it?
It's just some thinking about MOLAP during I using kylin. How do you think
about this? Looking forward your reply, sincerely.
Maybe here are some mistake or misunderstanding, please feel free to
correct me or discuss further more if you find any of them.
Best regards
yuzhang

yuzhang
[email protected]

<
https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=yuzhang&uid=shifengdefannao%40163.com&iconUrl=http%3A%2F%2Fmail-online.nosdn.127.net%2Fsm1c0446ade9371d208d1e209c8bc0827f.jpg&items=%5B%22shifengdefannao%40163.com%22%5D

签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制

Re: [DISCUSSION] Don't need to purge existing segment of cube to add new measures in Kylin

Reply via email to