Hi
I'm struggling with the following issue.
I need to build a cube with 6 dimensions for app usage
for example:
-------+-------+------+-----+------+------
user | app | d3 | d4 | d5 | d6
-------+-------+------+-----+------+------
u1 | a1 | x | y | z | 5
-------+-------+------+-----+------+------
u2 | a1 | a | b | c | 6
-------+-------+------+-----+------+------
the dimensions combinations generate ~100M rows daily.
for each row, I need to calculate the unique monthly active users, weekly
active users and daily active users, along with some other data (that can
be simply added up)
I can load the data of the last 30 days, each day, and calculate a cube
with countDistinct('userId)
but this requires a huge cluster, and is quite expensive.
I tried to use Hyper Log Log, and store the byte array of the HLL of the
previous day, de-serialize it, add the users of the current day, calc the
new distinct, and serialize the byte array for the next day.
however, to get 5% error accuracy with HLL, the byte array has to be 4K
long, which makes the 100M rows, be ~ 4000 times bigger. and i ended up
requiring a lot more resources.
I wonder if one of you can think of a better solution.
Thanks
Tal
--
*Tal Grynbaum* / *CTO & co-founder*
m# +972-54-7875797
mobile retention done right