Hi,

I am working on some large scale unique users analysis (think hundreds of
millions of records per day). Since number of all records per month goes
into many billions I am hoping that there may be some alternative to running
"SELECT DISTINCT user_unique_id..." such as sampling data or perhaps
deriving monthly numbers based on daily numbers of unique users with a use
of statistical linear regression or a similar technique.

I tried to use sampling but that does not seem to give me the results I
would expect. That is, for example if I sample every 1/100 record I would
expect to see about 1% of the total number of unique users from a given
period of time but instead I am seeing bigger numbers, for example something
like 5%. I am not sure if the law of large number applies to unique users
analysis.

Moreover it seems that "select distinct" does not scale linearly (which is
understandable) and so doing monthly unique users analysis takes way longer
than 30 x time to perform daily unique users analysis and requires a lot of
memory.

I was wondering if anyone had a similar challenge?

Many thanks,
Radek

Reply via email to