dario-liberman opened a new issue, #10866:
URL: https://github.com/apache/pinot/issues/10866

   A common use-case in real-time interactive analytic applications is [funnel 
analysis](https://en.wikipedia.org/wiki/Funnel_analysis).
   
   Example funnel analytics products to name a few:
   - [Mixpanel](https://mixpanel.com/analysis)
   - [Amplitude](https://amplitude.com/amplitude-analytics)
   - 
[Datadog](https://docs.datadoghq.com/real_user_monitoring/funnel_analysis/) 
   - [Google Marketing Platform](https://marketingplatform.google.com/)
   
   At the core of these solutions is the ability to correlate user events 
across steps, such as going thru a checkout funnel.
   A typical metric is the count of users reaching each step in a sequential 
funnel, or the respective conversion rates between steps (simply the ratio of 
such counts from one step to the next).
   
   The proposal is to build an aggregation function able to correlate events by 
a given column, such as the user id, the session id, etc.
   
   In order to simplify the initial implementation, the proposal is to start 
with a causality based funnel rather than requiring a strict sequence in time. 
Most user funnels in real-life can only be exercised forward, so this can solve 
for the vast majority of use-cases.
   What is meant by causality is that we will count towards step N only if we 
counted towards step N-1 for the same correlation id, but we would accept for 
them to happen out of order (in fact, that is often the case in real-life 
distributed user event logging systems anyways).
   
   In order to simplify the solution further, we can expect that the table is 
partitioned by the correlation column (eg. user id).
   
   Once these assumptions are accepted, then the solution looks very similar to 
what is done in SEGMENT_PARTITIONED_DISTINCT_COUNT, which uses roaring bitmaps 
under the hood. The first step in the funnel would exactly match that, the 
second step in the funnel is simply the intersection between the first roaring 
bitmap and the second, and so on.
   
   The following aggregation function API is proposed:
   
   ```
   SELECT
      dateTrunc('day', timestamp) AS ts,
      FUNNEL_COUNT(
         STEPS(url = '/addToCart', url = '/checkout', url = 
'/orderConfirmation')
         CORRELATED_BY(user)
       ) as step_counts
   FROM user_log
   GROUP BY 1
   ```
   
   This query would return for each day the number of users going thru the 
funnel defined by the steps above as an array of longs, one count per step. An 
outer query or the client application (eg. a grafana plugin) can compute 
conversion rates if necessary.
   
   A user may optimise the query by filtering relevant steps as follows:
   ```
   SELECT
      dateTrunc('day', timestamp) AS ts,
      FUNNEL_COUNT(
         STEPS(url = '/addToCart', url = '/checkout', url = 
'/orderConfirmation')
         CORRELATED_BY(user_id)
       ) as step_counts
   FROM user_log
   WHERE url in ('/addToCart',  '/checkout', '/orderConfirmation')
   GROUP BY 1
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to