dario-liberman opened a new issue, #10866: URL: https://github.com/apache/pinot/issues/10866
A common use-case in real-time interactive analytic applications is [funnel analysis](https://en.wikipedia.org/wiki/Funnel_analysis). Example funnel analytics products to name a few: - [Mixpanel](https://mixpanel.com/analysis) - [Amplitude](https://amplitude.com/amplitude-analytics) - [Datadog](https://docs.datadoghq.com/real_user_monitoring/funnel_analysis/) - [Google Marketing Platform](https://marketingplatform.google.com/) At the core of these solutions is the ability to correlate user events across steps, such as going thru a checkout funnel. A typical metric is the count of users reaching each step in a sequential funnel, or the respective conversion rates between steps (simply the ratio of such counts from one step to the next). The proposal is to build an aggregation function able to correlate events by a given column, such as the user id, the session id, etc. In order to simplify the initial implementation, the proposal is to start with a causality based funnel rather than requiring a strict sequence in time. Most user funnels in real-life can only be exercised forward, so this can solve for the vast majority of use-cases. What is meant by causality is that we will count towards step N only if we counted towards step N-1 for the same correlation id, but we would accept for them to happen out of order (in fact, that is often the case in real-life distributed user event logging systems anyways). In order to simplify the solution further, we can expect that the table is partitioned by the correlation column (eg. user id). Once these assumptions are accepted, then the solution looks very similar to what is done in SEGMENT_PARTITIONED_DISTINCT_COUNT, which uses roaring bitmaps under the hood. The first step in the funnel would exactly match that, the second step in the funnel is simply the intersection between the first roaring bitmap and the second, and so on. The following aggregation function API is proposed: ``` SELECT dateTrunc('day', timestamp) AS ts, FUNNEL_COUNT( STEPS(url = '/addToCart', url = '/checkout', url = '/orderConfirmation') CORRELATED_BY(user) ) as step_counts FROM user_log GROUP BY 1 ``` This query would return for each day the number of users going thru the funnel defined by the steps above as an array of longs, one count per step. An outer query or the client application (eg. a grafana plugin) can compute conversion rates if necessary. A user may optimise the query by filtering relevant steps as follows: ``` SELECT dateTrunc('day', timestamp) AS ts, FUNNEL_COUNT( STEPS(url = '/addToCart', url = '/checkout', url = '/orderConfirmation') CORRELATED_BY(user_id) ) as step_counts FROM user_log WHERE url in ('/addToCart', '/checkout', '/orderConfirmation') GROUP BY 1 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
