Hi Shaofeng: We also take some experiment for add measure after cube be built and encountered byte error at the very start. The default mapping strategy between HBase store and measure definition is "multiple measures are stored in one column of column family", which may cause byte error after add a measure and insert it in original measure sequence. Add an column for new measure may be better, I think. I just have a preliminary idea, may be impractical for now, about the measure management design. Dimensions and metrics are defined once model be designed. The measure aggregate the metrics in different dimensions to observe the data entities represented by the model. All of these are design of 'logical view', I think. The Cube is materialized view of these logical model, which is the bridge between the logical view and the physical storage (and the highway is set up). The life cycle of the measure may depend on the model rather than the cube.
Based on the design, an measure management can be set up after model design be completed. We can define the measure based on model. Cubes under the model can reuse those measure and build their segment data. When a SQL arrive, Kylin query server need to find the suitable model with suitable measure, then find the available cube. Of course, such an design change will have a very large impact on the existing kylin architecture, and the query and metadata will have very large changes. So it seems that it is still on paper. More realistic or transitional design is increasing the metadata of the measure. Just as CubeDesc defines the schema, and a relative CubeInstance manages the built Segments. MeasureDesc can also has a MeasureInstance to manage the segment containing it. I observed that kylin's query service generates a GridTable for mapping between logical views and HBase physical storage: Cuboid + Measure -> Grid Table <- HBase store. This Grid Table is generated based on CubeDesc and has such a mapping process for each Segment. Therefore, in the mapping stage, it is possible to know which columns of the Grid Table can't be obtained in current segment by the metadata. So the measure data can be selectively read at the RS backend. But its life cycle is the same as MeasureDesc, managed by CubeDesc. Regarding adding dimensions to the same cube, we also need to consider aggregation groups and Rowkey order. I am curious and interesting how you implemented it. Best regards yuzhang | | yuzhang | | shifengdefan...@163.com | 签名由网易邮箱大师定制 On 4/22/2019 09:05,ShaoFeng Shi<shaofeng...@apache.org> wrote: Hi Yuzhang, Glad to see such a discussion; How to support "schema change" in a friendly way is what we should do in the next phase, as we see this requirement is stronger than before. Last week I also did a try on 1) adding a dimension after cube be built, and 2) adding a measure after cube be built; For 1) I have got an idea, the first try was successful, and want to discuss it with the community in some day. The 2) was failed; after a new measure is added, the query got failed and in HBase RS side there is byte parsing error. Then I didn't continue that. Could you elaborate your idea on "the measures of the analysis system can be decoupled from the materialized view(cube) and have their own management system"? Have you got a rough design on it? Thank you! Best regards, Shaofeng Shi 史少锋 Apache Kylin PMC Email: shaofeng...@apache.org Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html Join Kylin user mail group: user-subscr...@kylin.apache.org Join Kylin dev mail group: dev-subscr...@kylin.apache.org yuzhang <shifengdefan...@163.com> 于2019年4月21日周日 下午8:08写道: Hi JiaTao: Maybe it's necessary that there is an optional auto-complete machanism among different measure's view, isn't it? yuzhang | | yuzhang | | shifengdefan...@163.com | 签名由网易邮箱大师定制 On 4/20/2019 11:38,JiaTao Tao<taojia...@gmail.com> wrote: Hi The idea that supports Kylin adding measures dynamically is impressive. But in my opinion, once you add a measure, the existing segments should also calculate the new measure(just add a new measure column). Users can have many cubes, a cube can have many segments, if measure's view is different in each segment, it will increase the burden of the user. -- Regards! Aron Tao yuzhang <shifengdefan...@163.com> 于2019年4月20日周六 上午1:43写道: Hi dear kylin users and develop team: Here have some things I want to discuss with community. As a representative of MOLAP engine, kylin uses pre-aggregation strategies to provide high-concurrency and second-level response analysis capabilities, but also loses some flexibility. The limitation that purge existing segment firstly to add an additional measure will cause many double calculation and unnecessary disk IO. Such waste should be avoid especially in MOLAP engine. For example, there is an cubeA with one measure m1 and segments over time range1(tr1). Now, user add one measure m2, but don't want to clear segments over tr1. The value of m2 will exist in tr2, the segments build subsequently. Sure, tr1 doesn't contain value of m2, which will be understanded by user who know litte about MOLAP. Querying over tr1 and tr2 is valid for both m1 and m2, but the result of m2 over tr1 will be null. It's will be better to reminder user the measure missing.Moreover, refreshing will supply the m2 to segments over tr1. Currently, kylin's storage engine uses HBase. The measure are aggregated values based on combination of various dimension members and stored in a column of a Column Family in HBase. For the same cube, adding a new measure will add a column to the HBase table(mapping) and will take effect in the next build. For the existing HTables(segments), the new column is allowed to be missing. Refreshing old existing segments will add a new column in their HTable to store new measure. Value of new measure is aggregated according to the combination of dimension members in rowkey, without recalculating existing measure. Now, For additional measure and even additional dimensions, Kylin's current solution is Hybrid, but we found the following shortcomings during use: 1. Management costs: Repeated maintenance of similar Cubes, most of which have many intersections of dimensions and indicators. If you want to perform optimization operations such as pruning, you need to configure all of these cubes. 2. A large number of cubes: The initial analysis of the business is not stable, and analysts often have the need to increase some measures. The cube is added continuously to the Hybrid group, which will produce a lot of cubes. 3. Repeat calculation: If you want to drop the old cube in the Hybrid group, you need to build the latest cube by compute historical data to cover the old cube. Those will result in a lot of waste. In addition, I felt that the metadata about the measure was not perfect during the applying of Kylin. 1. As one of the most important concerns of analysts, if the measures of the analysis system can be decoupled from the materialized view(cube) and have their own management system, it may be more flexibility. 2. Once the dimensions have been choose in cube designing, it's cuboids are confirmed no matter the number of measures. It may make confuse to maintenance cubes with different measures but same cuboids. Cubes with different cuboids should be considered different cube, which is the definition of cube, isn't it? It's just some thinking about MOLAP during I using kylin. How do you think about this? Looking forward your reply, sincerely. Maybe here are some mistake or misunderstanding, please feel free to correct me or discuss further more if you find any of them. Best regards yuzhang yuzhang shifengdefan...@163.com < https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=yuzhang&uid=shifengdefannao%40163.com&iconUrl=http%3A%2F%2Fmail-online.nosdn.127.net%2Fsm1c0446ade9371d208d1e209c8bc0827f.jpg&items=%5B%22shifengdefannao%40163.com%22%5D 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制